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Abstract 

Undirected graphs can be used to describe matrix variate distributions. In this paper, we develop new methods for 
estimating the graphical structures and underlying parameters, namely, the row and column covariance and inverse 
covariance matrices from the matrix variate data. Under sparsity conditions, we show that one is able to recover the 
graphs and covariance matrices with a single random matrix from the matrix variate normal distribution. We establish 
consistency and obtain the rates of convergence in the operator and the Frobenius norm. We then extend this analysis to 
the multiple-sample instances, and show that having replicates will allow one to estimate more complicated graphical 
structures and achieve faster rates of convergence in both norms. We provide simulation evidence showing that we can 
recover graphical structures as well as estimating the precision matrices, as predicted by theory. 

Keywords. Graphical model selection, covariance estimation, inverse covariance estimation, Graphical 
Lasso, Matrix variate normal distribution. 

1 Introduction 

The matrix variate normal model has a long history in psychology and social sciences, and is becoming in- 
creasingly popular in biology and genetics, econometric theory, image and signal processing, and machine 
learning in recent years. In this paper we present a theoretical framework to show that one can estimate the 
covariance and inverse covariance matrices well using only one matrix from the matrix-variate normal dis- 
tribution. The motivation for this problem comes from many applications in statistics and machine learning. 
For example in microarray studies, a single / x m data matrix X represents expression levels for m genes on 
/ microarrays; one needs to find out simultaneously the correlations and partial correlations between genes, 
as well as between microarrays. Another example concerns observations from a spatio-temporal stochastic 
process which can be described with a matrix normal distribution with a separable covariance matrix S®T, 
where typically, S is called spatial covariance, T is called the temporal covariance, and ® is the Kronecker 
product. When the stochastic process is sptial-temporal, some structures can be assumed for one or both of 
the matrices in the Kronecker product. However, typically one has only one observational matrix. 

We call the random matrix X which contains / rows and m columns a single data matrix, or one instance 
from the matrix variate normal distribution. We say that an/xm random matrix X follows a matrix normal 
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distribution with a separable covariance matrix S = A B, which we write 

X fxm ~ Mf, m {M,A mxm ® Bf x f), (1) 

is equivalent to say vec { X } follows a multivariate normal distribution with mean vec { M } and covariance 
E = A <g) B. Here vec{ X} is formed by stacking the columns of X into a vector in R m ^. Intuitively, 
A describes the covariance between columns of X while B describes the covariance between rows of X. 
See Gupta and Varga (1992); Dawid (1981) for characterization and examples. Note that we can only 
estimate A and B up to a scaled factor, as Ar\ ® ^B = A B for any r\ > 0, and hence this will be our goal 
of the paper, and precisely what we mean, when we say we are interested in estimating covariances A and 
B. 

Undirected graphical models are often used to describe high dimensional distributions. We will use such 
descriptions in the present work t o encode structural assumptions on the inverse of the row and column 
covariance matrices. A common structural assumption is that the inverse covariance matrices, also known 
as the precision matrices, are sparse, which means that the number of nonzero entries (sparsity levels) in one 
or both of them are bounded. Under sparsity assumptions, a popular approach to obtain a sparse estimate 
for the precision matrix is given by the ^i-norm regularized maximum-likelihood function, also known as 
the GLasso Yuan and Lin (2007); Friedman et al. (2008); Banerjee et al. (2008); Rothman et al. (2008). 
All these methods and their analysis assume that one is given independent samples and the estimation of A 
or B alone was their primary goal, as they all assume that X has either independent rows or independent 
columns. A direct application of the GLasso estimator to estimate A® B with no regard for its separable 
structure will lead to computational misery, as the cost will become prohibitive for /, m in the order of 100 
(c.f. Section 2.4). A mean-restricted matrix-variate normal model was considered in Allen and Tibshirani 
(2010), where they proposed placing additive penalties on estimated inverse covariance matrices in order 
to obtain regularized row and column covariance/precision matrices. The focus of their work is on missing 
value imputation, for example, when the data is given by the Netflix movie rating data, rather than estimation 
of the graphs or the underlying parameters. We review this and other closely related work in Section 3.2. 

In this work, we take a penalized approach and show from a theoretical point of view, the advantages of 
treating the estimation of covariance matrices A, B, and the graphs corresponding to their inverses simul- 
taneously albeit through separable optimization functions. The key observation and starting point of our 
work is: although A and B are not identifiable given the separable representation as in (1), their correlation 
matrices p(A) and p(B), and the graphical structures corresponding to their inverses are identifiable, and 
can indeed be efficiently estimated for a given X ~ A// )m (0, A (g) B). Moreover, in the mean-zero matrix 
normal model, p(A)^ 1 and p(B)~ x encode the same amount of structural information as A^ 1 and B~ l do, 
in the sense that they share an identical set of non-zero edges. Therefore, we propose estimating the overall 
S = A ® B and its inverse by (1) first estimating correlation matrices p(A) and p(B) (and their inverses) 
simultaneously using a pair of £i-norm penalized estimators for an instance X ~ A// m (0, A ® B), (2) and 
then combining these two estimators with the estimated variances to form an estimator for S. 

Contributions. In this paper, we will answer the following question: how sparse does A^ 1 or B^ 1 need 
to be in order for us to obtain statistical convergence rates in terms of the operator norm and the Frobenius 
norm for estimating A and B (up to a scaled factor) simultaneously with one data matrix XI Toward this 
end, we develop Gemini, a new method for estimating graphical structures, and the underlying parameters A 
and B, in a mean-zero matrix variate normal model. Under suitable assumptions, we establish convergence 
bounds in the operator and the Frobenius norm for estimating both A, B and their inverses. Our estimators 
and convergence analysis extend, with suitable adaptation, to the general setting where n replicates of X 
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are available, for which with all other parameters hold invariant, the rates of convergence for estimating A, 
B, and their inverses will be proportional to ra -1 / 2 . In addition, we provide simulation evidence that we can 
recover graphical structures as well as estimate the precision matrices effectively. 

In summary, we make the following theoretical contributions: (i) consistency and rates of convergence 
in the operator and the Frobenius norm of the covariance matrices and their inverses, (ii) large deviation 
results for the sample correlation estimators which we propose for estimating both the row and column 
correlations given a single matrix or multiple replicates of the matrix-normal data, (hi) conditions that 
guarantee simultaneous estimation of the graphs for both rows and columns. To the best of our knowledge, 
these are the first such results on the matrix-variate normal distributions in the high dimensional setting 
for finite sample instances, by which we mean n < logmax(m, /). Also worthy of mentioning is the 
computational efficiency of our method. The dominating costs involve in estimating p(A)^ 1 and p(B)^ 1 , 
the total cost of which is in the order of 0(/ 3 + m 3 ) for sparse graphs or 0(/ 4 + m 4 ) for general graphs. 

There is no known closed-form solution for the maximum of the likelihood function for the matrix-variate 
normal distribution (cf. (9)). There has been a line of work in the literature which suggested using iterative 
algorithms, namely, the flip-flop methods to estimate the covariance matrix with the Kronecker structure; 
see for example Dutilleul (1999); Lu and Zimmerman (2005); Werner et al. (2008) and references therein. 
In the present work, building upon the baseline Gemini estimators, we also propose a three-step penalized 
variant of the flip-flop algorithms in Section 5. We show that under certain conditions, this approach yields 
improvements upon the baseline Gemini estimators. 

The rest of the paper is organized as follows. In Section 2, we will define our model and the method. 
Section 3 presents the main theoretical results in this paper on estimating A (g> B, as well as discussions 
on our method and results; moreover, we review the related work on covariance estimation and Gaussian 
graphical model selection, to place our work in context. Section 4 provides large deviation inequalities for 
the sample correlation coefficients in approximating the underlying parameters of p(A) and p{B). Moreover, 
convergence rates in the Frobenius norm for estimating the inverse correlation matrices are derived. We 
propose a non-iterative penalized flip-flop algorithm and study its convergence properties in Section 5 and 6. 
Section 7 shows our numerical results. We conclude in Section 8. We place all technical proofs in the 
Appendix. 



1.1 Notation 

For a matrix A, let |^4| max = maxjj \ Aij\ denote the element-wise or entry-wise max norm. Let \A\ 
denote the determinant and tr(A) be the trace of A. Let ip ma , x (A) and ip m i n (A) be the largest and smallest 
eigenvalues, respectively. We use A~ T as a shorthand notation for (A~ l ) T . We write | • |i for the l\ norm of 
a matrix vectorized, i.e., for a matrix \A\\ = ||vec { A}||j = J2i J2j \Aij\- Let k(A) denote the condition 
number for matrix A. We write \A\ l off = \-Aij\> an( ^ write \A\ off for the number of non-zero non- 

diagonal entries in the matrix. We let C be a constant which may change from line to line. For two numbers 
a, b, a A b := min(a, b), and a V b := max(a, b). We write diag(yl) for a diagonal matrix with the same 

diagonal as A. The matrix Frobenius norm is given by ||TV||^ = \jYli Ylj w ij- The operator norm ||W||2 

is given by ip m a,x(WW T ). For a symmetric matrix A, let T(A) = (vij) where Vij = I Qi ^o> where I/a is 
the indicator function. We write a x b if ca < b < Ca for some positive absolute constants c, C which are 
independent of n, /, m or sparsity parameters. 
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Figure 1: Column and row vectors of matrix X. The normalized column vectors x l / ^a\\, . . . , x m / ^/a mm , 
where an > 0,Vi follow a multivariate normal distribution Nj(0,B) while normalized column vectors 
y 1 1 V / &n> j • • • j U I \A/7' where bjj > 0, for Y = X T follow a multivariate normal distribution N m (0, A). 

2 The model and the method 

In the matrix variate normal setting, we aim to estimate the row and column covariance (correlation) ma- 
trices, from which we can obtain an estimate for E. The problem of covariance estimation in the context 
of matrix variate normal distribution is intimately connected to the problem of graphical model selection, 
where the graphs corresponding to the column and the row vectors are determined by the sparsity patterns 
(or the zeros) of B^ 1 and A^ 1 respectively. Graph estimation in this work means precisely the estimation of 
the zeros, as well as the non-zero entries in A~ x and B . We formulate such correspondence precisely in 
Section 2.1. To resolve the dilemma in covariance estimation arising from unequal dimensions between the 
row and the column, a point which we will elaborate more in Section 3.1, we take a penalized approach to 
be defined in Section 2.2. We make connections between the Gemini estimators and the likelihood function 
of the matrix variate normal distribution in Section 2.4. 



2.1 Problem definition: the matrix normal graphical model 

We show in Figure 1 the data matrix X and its column vectors: x 1 , x 2 , . . . , x k , . . . , x m , and row vectors 
y 1 , y 2 , . . . , yf . This notation is followed throughout the rest of the paper. First recall the following definition 
concerning the classical Gaussian graphical model for a random vector. 

Definition 2.1. Let V = (Vi, Vf ) T be a random vector with distribution P, which we represent by an 
undirected graph G = (V, F). The vertex set V := {1, ...,/} has one vertex for each component of the 
vector V. The edge set F consists of pairs (j, k) that are joined by an edge. IfVj is independent ofV^ given 
the other variables, then (j, k) F. When V is Gaussian, missing edges correspond to zeros in the inverse 
covariance matrix. 

Now let V = {1, ...,/} be an index set which enumerates rows of X according to a fixed order. For all 
% = 1, . . . , m, we assign to each variable of a column vector x l exactly one element of the set V by a rule of 
correspondence g : x l — > V such that g(xj) = j,j = 1, . . . , /. The graphs Gj(V, F) constructed for random 
column vectors x l ,i = 1, ... ,m according to Definition 2.1 will share an identical edge set F. Hence 
graphs Gi, . . . , G m are isomorphic and we write G{ ~ Gj,\/i,j. Due to the isomorphism, we use G(V, F) 
to represent the family of graphs G\, . . . , G m . Hence a pair (£, k) which is absent in F encodes conditional 
independence between the I th row and the k th row give all other rows. Similarly, let T = {!,..., m} be 
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the index set which enumerates columns of X according to a fixed order. We use H(T,E) to represent the 
family of graphs Hi, ... , Hj, where Hi is constructed for row vector y l , and H L ~ Hj , Vi, j. Now H (T, E) 
is a graph with adjacency matrix T(H) = T(A^ 1 ) as edges in E encode non-zeros in A^ 1 . And G(V, F) 
is a graph with adjacency matrix T(G) = T(B^ 1 ). The Kronecker product, H ® G, is defined as the graph 
with adjacency matrix T(H) <g> T(G) Weichsel (1962), where clearly missing edges correspond to zeros in 
the inverse covariance A^ 1 (g> JB -1 , and H®G represents the graph of the p-variate Gaussian random vector 
vec { X }, where p = mf. In the present work, we aim to estimate T(H) and T(G) separately. Estimating 
their Kronecker product directly following the classical p-variate Gaussian graphical modeling approach 
will be costly in terms of both computation and the sample requirements. 



2.2 The Gemini estimators 

We start with the one-matrix case, where n = 1, and m and / are allowed to grow with respect to each other. 
The first hurdle we need to deal with, besides the simultaneous row and column correlations, is the fact that 
between the two covariance matrices A and B (as well as their inverses), the one with the higher dimension, 
which contains more canonical parameters, is always left with a smaller number of correlated samples in 
order to achieve its inference tasks. The remedy comes from the following observation. Although ambient 
dimension /, m can not be both bounded by the other unless f = m, the sparsity over non-diagonal entries 
of each precision matrix can be assumed to be bounded by the ambient dimension of the other. Under such 
sparsity assumptions, we first provide a pair of separable regularized estimators 



A = argmin \tv(f(A)A~ 1 ) + logdetvl + X B \A^\ \ (2a) 
B = argmin jtr(f (.B)^ 1 ) + logdet£ + A j4 |-B _1 | loff } (2b) 



for the correlation matrices p(A) = (oy / \Ja>ua>jj) and p(B) = (bij/ ybubjj) , where the input are a pair of 
sample correlation matrices r(^4) and T(B) 

fg(A) := „ [**'f_l and f^B) == J*'* I , (3) 
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and the l\ penalties are imposed on the off-diagonal entries of the inverse correlation estimates. Moreover, 
the penalty parameters A# and A^4, to be defined in Section 3, are chosen to dominate the maximum of 
entry-wise errors for estimating p(A) and p(B) with T(A) and F(B) respectively. Note that the population 
parameters A and B can then be written as follows: 

A ®B := (Wip(A)Wi) ® (W 2 p(B)W 2 ) /(tr(A)tr(B)) 



where W\/ \Jtr{B) = diagram, . . . ^a mm ) and Wij \/tv{A) = diag(\/5ii, • • • y/bJJ). In order to get 
an estimate for A ® B, we multiply each of the two regularized estimators A and B by an estimated weight 
matrix W\ or W 2 respectively, 

Wx = di a g(||x 1 || 2 ,||x 2 || 2 ,...,||x m |l2) = ^g{X T X) 1 ' 2 , (4a) 
W 2 = diag(||y 1 || 2 ,|| 2 / 2 || 2 ,...,||y^|| 2 ) = diag(y T F) 1 / 2 , (4b) 
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where Y = X T . Up to a multiplicative factor tv(B) and ti(A), W± and W% will provide an estimate for 
diag(^l) and diag(i?) respectively; To estimate A B, we compute the Kronecker product of our weighted 
estimators, 



AV)B 



w^wA ® (W2BW2) /\\x\\p 



while adjusting the unknown multiplicative factors tv(B)tv(A) by ||-X"|| F . We restate Theorem 4.1 in case 
n = 1 in Corollary 2.2. 

Corollary 2.2. Let max(m, /) > 2. TTjeTi w/f/j probability at least 1 — max (^ Vi 7^ j, we have for F(A) 
and F(B) are constructed as in (3), 

fy(B)- W (B) 

fy(A)-^-(A) 
||X||| /(tr(A)tr(S)) - 1 



by 



ll x 112 H X 112 



< 



< 



a 



1 — a 

P 
1-/3 



+ \Pij( B )\ 7 

+ \piM)\ 



a 



a 



ft 



1-/3 



< a A /3, where 



a 



on P F logmax(m, /) £ F logmax(m, /) 

20 —T7 , and p = 20 — 

tr(A) tr(J3) 



2.3 Gemini for replicates of X 

We now adapt the Gemini estimators as defined in Section 2.2 to the general setting where we have multiple 
replicates of X. Suppose that we have n independently and identically distributed matrices X(l) , . . . , X(n) ^ 
Nf m (0, A <8> B). For each t, we denote by 

X{t) = [x{tfx{tf . . . x(t) m ] = [y{t) l y{tf . . . y(t)f] T 

the matrix Xf Xm (t) with x(i) 1 , . . . , x{t) m £ R/ being its columns vectors and y l (t), 
row vectors. 



(5) 

, yf(t) being its 



First, we update our sample correlation matrices: 

Tij(A) := 

r<i(s) := 



ztiMm\yztiMty\\l 

Et=i(y(ty,y(t) j ) 



(6a) 



(6b) 



Et=i\\y(mUEt=i\\y(ty\\ 2 



We will show in Theorem 4.1 large deviation bounds for estimating the correlation coefficients in p(A) and 
p(B) with entries in sample correlation F(A) and F(B) constructed above. Plugging these updated F(A) 
and F(B) in (2a) and (2b) will return us the penalized correlation estimators A and B. Next, we update the 
weight matrices W\ and W2 as follows: 



Wi 
W 2 



n? 



(7a) 
(7b) 
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We can then construct an estimator for A ® B as before, 

A®B := (WiAW^ ® (W 2 BW 2 ) / (^Et=i W X (*)IIf) • (8) 
In Section 3, we state the convergence rates for estimating A ® 5 and its inverse with A ® B and A ® Z? . 

2.4 Discussion 

As mentioned, these optimization goals in (2a) and (2b) are intimately connected to the likelihood function 
of the matrix variate normal distribution. Let X n = (X(l), . . . , X(n)), where X(i) ~ A// iTn (M, A ® £?), 
we have up to an additive constant, 

-2logp(X n \M,A,B) = ti ((A' 1 ® B- 1 ) S n } +f log \A\ + mlog|fl| 

where S n = ^ E?=i vec { ^(t) — M } vec { X(t) — M } T is the sample covariance matrix for vec { X }. 
We assume that M = 0. There is no known closed-form solution for the maximum of the likelihood 
function for the matrix- variate normal distribution. Essentially the flip-flop methods Dutilleul (1999); Lu 
and Zimmerman (2005) couple the estimation for A and B by feeding the current estimate for either of the 
two into the likelihood function (or the penalized variants to be defined) in order to optimize it with respect 
to the other. Upon initialization of A with an identity matrix, they obtain the MLE for A and B by solving 
the following two equations alternately and iteratively 

1 " 1 n 

B{A) = — J2 X^A^Xitf, A(B) = — f Yl X{t) T B- l X{t) (9) 
t=i J t=i 

such that the corresponding output B, or A becomes the input as B, or A to the RHS of equations in (9); 
this process repeats until certain convergence criteria are reached. 

In the present work, we simultaneously optimize a pair of convex functions (2a) and (2b), which can be seen 
as a single-step approximation of a penalized version of (9), where we set both B and A on the RHS of 
equations in (9) to be the identity matrix. To see this, we note that 

f(A) = diag(2(/))- 1 / 2 l(I)diag(l(/))- 1 /2 
and f(B) = diag( J B(/))- 1 / 2 J B(/)diag( J B(/))- 1 / 2 

where T(A), T(B) are the sample correlation matrices as defined in (6a) and (6b), A(I) = i Ylt=i J Sjfe=i y(t) k( 3 
y{t) k are the average of diagonal blocks in S n , which is the sample covariance matrix for vec { X T ), and 
B(I) = I Et=i h EfcLi ® ^) fc ^ the avera 8 e of those in S n . 

Closest to our methods is that of Allen and Tibshirani (2010). They aim to optimize C(A, B) := tr((yl _1 ® 
B~ 1 )S n ) + mlogdetB + / log det A + Ail^li + A 2 | J B _1 | 1 over both A and B, where | • |i penalty 
can also be replaced with the Frobenius norm penalty instead. They argue that using two separate penalty 
parameters or even two kinds of penalties gives greater flexibility in terms of separately modeling the row 
and column covariances. A popular method for minimizing the objective C(A, B) is the block coordinate 
descent method Tseng (2001). Following essentially the arguments in Tseng (2001), they show that while 
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block coordinate-wise maximization reaches a stationary point, it is not guaranteed to reach the global 
maximum of the bi-convex optimization function of A^ 1 and B^ 1 . 

In contrast, in Gemini, a unique pair of optimal solutions A and B can be obtained via the GLasso algorithm: 
Upon computing the sample correlation matrices T(A) and T(B) as in (6a) and (6b), one can then rely 
on publicly available software such as the glasso package in R, which implements the graphical Lasso 
algorithm Friedman et al. (2008) to solve (2a) and (2b). The advantage of estimating the graphs for A^ 1 and 
B^ 1 via separable optimization functions is motivated by both statistical and computational considerations. 
Under sparsity constraints and upon multiplication by proper weight matrices, the penalized estimators A 
and B are strikingly effective in approximating the row and column covariance matrices. See Theorem 4.3 
and Corollary A.3. Our approach is computational efficient, in that, the main cost of our estimators involves 
solving two GLasso problems, the total computational cost of which is in the order of 0(/ 3 + m 3 ) for 
sparse graphs or 0(/ 4 + m 4 ) for general graphs, while solving a global GLasso program in general will 
cost between 0(/ 3 m 3 ) and 0(/ 4 m 4 ). This will become prohibitive for /, m in the order of 100. 

We use theoretical analysis to illustrate the advantage of choosing separate penalty parameters on the es- 
timated inverse correlations and covariances. In practice, we use cross-validation to select the penalty pa- 
rameters as we will show in our numerical examples. It is conceivable that the iterative methods are much 
more computational demanding in that it requires cross-validation at each iteration. Hence we study a non- 
iterative penalized flip-flop algorithm and present its convergence properties in Section 5 and 6. 

3 Theoretical results 

In this section, we present in Theorem 3.1 and Theorem 3.2 the convergence rates for estimating the row 
and column covariance matrices and their inverses with respect to the operator norm and the Frobenius 
norm respectively. Our analysis is non-asymptotic in nature; however, we first formulate our results from an 
asymptotic point of view for simplicity. To do so, we consider an array of matrix variate normal data 

X(l),...,X(n)ii.d.~JV>, m (0,^o®B ) ) n = l,2,... (10) 

where /, m, | Aq 1 1 off , and | Bq 1 1 off , which are the number of non-zero non-diagonal entries in the inverse 
covariance matrices, may change with n. 

We make the following assumptions. 

(Al) The dimensions of /, m are allowed to grow with respect to each other and for d = 1 — (c/2 A 1/2) 

WhereC = loglSv/) > 

K'lo.off = o(nf/\og 2d {mVf)) (/,m^oo), 
and l^lo.off = o(nm/\og 2d (my /)) (f,m -too). 

(A2) The eigenvalues ipi(Ao),(pj(Bo),Vi,j of the positive definite covariance matrices Aq and Bq are 
bounded away from and +oo. 

For n = 1, (Al) implies that, up to a logarithmic factor, the number of non-zero off-diagonal entries in 
p{Aq)~ 1 or p(Bq)^ 1 must be bounded by the dimension of the other matrix. Let a m - m = minj Aq^u and 
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&min = minj Bo,**- By positive-definiteness of An and Bn, we have a m ; n > and & m ; n > 0. Then 
clearly for a max = maxj ,4 ,m and 6 max = maXi.Bo,ii, we have < a min < a max < \\A \\ 2 < +oo and 

< 6min < Vax < ||-Bo|| 2 < +°° on ( A2 )- 

We now state the main results of this paper. These results are new to the best of our knowledge. These 
bounds are stated in terms of the relative errors; The absolute error bounds are given in Theorem A. 1 and 
Theorem A. 2. The proofs for Theorems 3.1 and 3.2 appear in Sections A.2 and A.3. 
Theorem 3.1. Consider data generating random matrices as in ( 10) and assume that (Al ) and (A2) hold. 
Let c 



log n 



log log(mv/) anc ^ d = 1 — (c/2 A 1/2). Suppose the penalty parameters are chosen to be 

HA)||_f log d max(m, /) log d max(m, /) 



and \b = Afj 



tr(^4o) \Jmn 
H-Bolli? log d max(m, /) log rf max(m, /) 



tr(i?o) \/n \fjn 

Then with probability at least 1 — j^x , for A® B as defined in (8), 

< 1 1 ^4_o 1 1 2 H-B0II2 <5> and 



0, 
0. 



max(m,/) 2 
A^B - Aq <g> Bn 



A®B 



-1 



A 



-1 



B, 



-1 



< LB, 



-II 



^0 1 1 1 2 ' wnere 



5, 5' >c log rf max(m, /) 




o(l). 



Theorem 3.2. Consider data generating random matrices as in (10). Let \a and Xb be chosen as in 
Theorem 3.1. Let A® B be as defined in (8). Under Assumptions (Al ) and (A2), we have 



A®B-A (g>B < 5 \\Aq\\ w \\B \\p where 

F 

6 = O (Aso^l^'lcoff v m /y/™ + x Aoyf\Bo\ og V 7/V?) = o(l). 



In particular, suppose (1) 1 < n < log(m V /) Or (2) \A n \ 0oS = 0(m) and \ B \ 0oS 
under (Al ) and (A2), we have 5 x \a + A^ and 



A <g> B - A ® B 



<5\\A \\ F \\B \\ F 



log d (m V /) 



The same conclusions hold for the inverse estimate 

-1 



A®B 



A' 1 
^0 



B, 



-1 




<d\\A^\ 



Lb, 



n 



(11) 

0(f). Then 
(12) 

(13) 



with 5 being bounded in the same order as above for each case. 

Remark 3.3. Let us examine the rate of (11) for some other cases. Suppose that \Aq 
[Bq 1 | q oS > f and n > log(m V /) is moderately large. Then under (Al ) 



-i| 



lo.off 



6 = { X B ^\A \ oS /m + X Ao ^/\B \ oS /f) 



1 j 1 

v7 V™ 



> m and 



(14) 



and 



A® B — An® Br 



<5\\A \\ F \\B \ 



o K/m+ V f 
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where we assume that both \A^ 1 1 off and \Bq 1 | off are allowed to grow linearly with n in the worst case, 
and hence the bound on 5 becomes independent of n. We note that for the rate we calculated in (14), we 
assume that n is not too large; otherwise, one can refine the calculations a bit further. 

For example, suppose that n > logmax(m, /) V —f-\ Then (Al) becomes vacuous and trivially, we 
have for d = 1/2, 



O 



m + / log 1 / 2 max(m, /) 
yjjrh 



n 



Hence when n X logmax(m, /) ( ~ V ^M> we have 5 = o ^^7= A ^ 



A®B — An 



< S \\A 



o\\f II-^oIIf 



and 

o(V f A y/m) 



We do not pursue such refinements in this work. 



3.1 Discussion 

To put our discussions on the rates of convergence for covariance estimation in context, we first present an 
example from the classical multivariate analysis. Consider the case where we are given a single sample from 
the matrix variate normal distribution with Bq = I, and the dimensions /, m increase to infinity, while the 
aspect ratio f/m— >• const > 1. 

The classical multivariate analysis focuses on estimating Aq using data matrix X whose rows are indepen- 
dent replicates following Af m (0, Ao). The simplest way to estimate Aq is to compute the sample covariance 

A f = jX T X = j J2{ =1 Xi (8) xi, where a* ~ M m (0, A ). 

The problem here is to determine the minimal number of independent rows we need so that the sample 
covariance matrix Af approximates A "well" in the operator norm. This concerns the classical "Bai-Yin 
law" in random matrix theory regarding the Wishart random matrix Af = jX T X, which says that the 

spectrum of A t is almost surely contained in the interval [a 2 / f +o(l),b 2 / f +o(l)] where a = (\/J— ^Jm)+ 
and b = y/J + ^fra in case Aq = I. For general covariance matrix Aq, it follows that, 

\\A f - Aoh < fey/m/f + (m/f) + o(l)) \\A \\ 2 (15) 

with high probability. We refer to Vershynin (2012) for a comprehensive exposition on such results. While 
such results provide a satisfactory answer to the covariance estimation problem in the regime / > m for 
general multivariate normal distributions, it remains challenging problems as to: (1) how to estimate the 
covariance matrix which has the larger dimension of the two? that is, how can we approximate Aq well 
in the operator norm when / < ml (2) how to estimate both Aq and Bq given both correlated rows and 
columns? 

Our answer to the first question is to use the penalized methods. The operator norm bound in Theorem 3. 1 
illustrates the point that the combination of sparsity and spectral assumptions as in (Al), (A2), and i\- 
regularization ensures convergence on estimation of the covariance and precision matrices, even though 
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their ambient dimensions may greatly exceed the given sample sizes. In particular, the ambient dimensions 
which appear in the numerator in (15) are replaced with the sparsity parameters: 

M'xlogmax(m,/) \off V 1 + -j^y/\ B o\ oS v = (!)- 



which holds for n = 1 with high probability under A(l) and (A2). We also pay an extra logarithmic factor 
for such dependences between both rows and columns. We note that between Aq and Bq, the dimension 
of one matrix is the same as the number of samples available for estimating parameters in the other matrix. 
For example, the number of examples we have for estimating parameters in matrix Aq using the date matrix 
X as shown in Figure 1 is /, which is the dimension of matrix Bq; and vice versa. The two summands in 
5 and 5' in Theorems 3.1 reflect the rates of convergence in the Frobenius norm for estimating p(Ag) and 
p(Bo) with A and B as in (2a) and (2b) (cf. Theorem 4.3). Moreover, the two summands in 5 and 5' in 
Theorems 3.1 and 3.2 also correspond to the rates of convergence in the operator and the Frobenius norm for 
estimating the row and column co variance matrices Aq, Bq, up to a scale factor, respectively. In particular, 
Corollary A. 3 provides rates of convergence for estimating A*, B* (and their inverses), where 

A* = mAo/tr(A ) and B* = B tr(A )/m, (16) 

in both norms with the following estimators: 

A, = mW 1 AW l /{lY, n i =x\\X{i)f F ) and B, = W 2 BW 2 /m (17) 

where clearly A* (g> B* = A<g> B. 

To answer the second question, first note that 

E [X T X/f] = E [l YLi V* ® V*] = A tr(B )/f 
E [XX T /m] = E [i JXi x i ® x<] = B tv(A )/m, 

which suggest that Af and B m = XX T /m are good starting points for us to construct estimators for p(Aq), 
p(Bq), and their inverses, despite the presence of dependence along the other dimension. We construct more 
sophisticated sample covariance and correlation estimators based on the pair of likelihood functions in (9) 
in Section 5. The relationships between the row correlations and column correlations of X are known to 
complicate the solution to the related problem of testing the hypothesis that microarrays are independent of 
each other given possibly correlated genes. For example, it was observed in Efron (2009) that the gene-wise 
correlation can induce the appearance of microarray-wise correlation through the "leakage" phenomenon in 
the doubly standardized situation for X. We illustrate such interactions between the row-wise and column- 
wise correlations and inverse correlations via the large deviation bounds which we derive in Section 6. 

To understand the rates for n > 1 and to build the connection between the one-matrix and the multiple- 
matrix cases, we imagine stacking matrices X(l), . . . , X(n) on top of each other to form a single nf x m 
matrix X'. This way, we can then imagine being in the situation of one-matrix case: X' ~ A/" n / )m (0, ® 
B'), where the dimension of Aq and B' is m x m and nf x nf respectively. Similar to the general case 
with one data matrix, we will use a sample of size nf to estimate the structure and parameters for Ao; 
however, unlike the one matrix case we have focused on so far, B' has an additional structural property 
other than the assumed sparsity on its inverse which allows for a faster rate of convergence, namely, matrix 
B' is block diagonal with n identical submatrices Bq along its diagonal. Hence the estimation step for B' 
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takes advantage of this knowledge by stacking Y(l), . . . , Y{n) on top of each other, where Y(i) = X(i) T , 
to form a nm x / matrix Y' ~ M nm j(0, Bo ® A'), where matrix A' is in turn block diagonal with n 
identical submatrices Aq staying along its diagonal. We treat Y' as having nm samples, all of which subject 
to normalization come from the same multivariate normal distribution Aff (0, Bo) as shown in Figure 1 . This 
enables faster convergence rates for n > 1 as shown in Theorems 3.1 and 3.2. 

Finally, in our model, we assume that both the rows and columns of X are centered by setting the mean 
matrix M = 0, but we do not assume that they are also scaled. That is, we allow the diagonals of A and B 
to take arbitrary positive real numbers which are bounded away from or +oo as assumed in (A2). It would 
be interesting to study the following mean-restiicted matrix- variate normal model: Xij = u \ + \ij + ey, 
where the two additive fixed effects i/j, uj depend on the row and column means and e ~ -A//,m (0, Aq Bq) 
is a mean-zero random effect. See Bonilla et al. (2008); Yu et al. (2009); Allen and Tibshirani (2010) and 
references therein for applications of such and other random effects models. 

3.2 Related work 

In the classical setting, various work Dutilleul (1999); Lu and Zimmerman (2005); Werner et al. (2008) fo- 
cused on algorithms and convergence properties on estimating £ using a large number of samples X(l), . . . , X(n). 
The flip-flop methods for estimating Aq ® -Bo with a finite number of iterations have been shown to converge 
asymptotically as well as the MLE as n — > oo Werner et al. (2008). Other recent work with an iterative ap- 
proach for solving the graphical model selection problem in the context of matrix variate normal distribution 
include Zhang and Schneider (2010); Yin and Li (2012); Tsiligkaridis et al. (2012). None of these work was 
able to show convergence in the operator norm which works in case n = 1 and /, m — > oo as in our work. 
When /, m diverge as n — > oo, the rates in Yin and Li (2012) are significantly slower than the corresponding 
ones in the present work. In particular, certain non-asymptotic rates in the Frobenius norm are obtained in 
estimating Aq and Bq; However, while estimating Bo or Ao, the ambient dimension of the other matrix, 
namely, Ao or Bo appears in the numerator instead of the denominator. This is undesirable, as we have 
discussed in Section 3. 1 that the effective sample size in estimating Bo and Ao should be nm and nf re- 
spectively. Following essentially the same methods as in Allen and Tibshirani (2010), the same convergence 
rate as in (12) on estimating the covariance £ = Ao <g> Bo in the Frobenius norm is obtained in Tsiligkaridis 
et al. (2012), upon a finite number of iterations in case \ Aq 1 1 off = 0(m), and \Bq X | off = O(f); how- 
ever, this rate is obtained with the additional requirement that the number of replicates of X must be at least 

the order of n > O ( V y J log max(/, m, n)\ . This excludes the case for n = 1 or for n < log(m V /), 
which is the main focus of the present paper. 

High-dimensional covariance (inverse) estimation has been intensively studied in the literature under various 
structural assumptions, see for example Huang et al. (2006); Furrer and Bengtsson (2007); Meinshausen 
and Buhlmann (2006); Lam and Fan (2009); Bickel and Levina (2008); El Karoui (2008); Yuan and Lin 
(2007); Friedman et al. (2008); Banerjee et al. (2008); Rothman et al. (2008); Ravikumar et al. (2008); Peng 
et al. (2009); Yuan (2010); Cai et al. (2010, 2011); Zhou et al. (2011). It is plausible that we can extend 
some of these methods and techniques on sparse (inverse) covariance estimation and apply to the matrix 
variate setting under a different set of assumptions. A nonparametric method for estimating time varying 
graphical structures using independent but not identical samples with provably convergence rates under 
suitable assumptions was studied by Zhou et al. (2008). It would be interesting to consider such models in 
the matrix variate normal setting. 
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4 Estimation of the correlation coefficients 



In this section, we elaborate on two key technical results, namely, the concentration bounds for sample 
correlation estimates and the convergence bounds for the penalized inverse correlation estimates. We use 
these results to prove the main theorems 3. 1 and 3.2. 



4.1 Concentration bounds for sample correlations 



We now show the concentration bounds for estimating the parameters in p(Ao) and p{B$). Theorem 4.1 
covers the "finite" sample setting where the number of replications n being upper bounded by log(m V /). 
We believe these are the first of such results to the best of our knowledge. For completeness, we also state 
the bounds when n is large. The proof appears in Section B.2. 

Theorem 4.1. Let 1 < n < log max(m, /). Consider data generating random matrices as in (10). Let us 
define for c 



log n 



iogiog(mv/) ' where max ( m > /) > 2, and d = l- c/2. Let 
a n := 7TiTo/tr(j4o) where tq = 2Q(\\Aq\\ f / ^fm)\og d m.zx.(m, 



"mn. 



(3 n := K/tr(i?o) where = 20(\\B \\ F /^)log d ma X (mJ)/^. 

, for a n , p n < 1 /3, and T(Aq) and T(Bq) as in (6a) and (6b), 



Then with probability at least 1 
we have 



max(m,/) 2 

Vi/j, Tij(B ) - pij(B ) 



< 



+ \pij(B ) 



a,. 



a. 



and 



Tij(A ) - pij(A c 



< 



fin ,i (a\\ 



< 3a n , 



<3/3 n , 



tr(i4o)tr(B ) < ti(A )ti (B )(a n A (3 n ). 



(18) 



When n > log(m V /), we always set d = 1/2 in tq and Tq, where 20 is replaced with a smaller constant to 
be defined in Theorem C.l. Then all three inequalities immediately above will continue to hold. 

The following large deviation bounds in Lemma 4.2 are the key results in proving Theorem 4.1. We write 
it explicitly to denote by Xq the event that all large deviation inequalities as stated in Lemma 4.2 hold. 
To establish such concentration bounds, we apply decorrelation techniques and the factorial-type moment 
estimate for the Gaussian chaos from Ledoux and Talagrand (1991) to ( x l , x J ') , and ( y l , y 3 ' ) , for each 
which make applications of Bernstein's and Bennet's inequalities Bennett (1962) possible. We leave 
the theoretical formulation and proofs in Sections B and B.3. 

Lemma 4.2. Denote by Xq the event that the following inequalities hold simultaneously for tq, Tq as defined 
in Theorem 4.1: 



Vi / j, 



l 

m I n 



h hU=i(\\y(ty 
iY,U{{y(t)\y(ty) /v?® 



- tr(Ao) < 

-^(S )tr(i4o)| < 

1 2 

\ 2 /On) -tr (Bq) < 

- Pij (A )tv{B )\ < Tq 



TO, 



TO, 



Then P (X ) > 1 



max(m,/) 2 ' 
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4.2 Bounds on estimating the inverse correlation matrices 



In this section, we show explicit non-asymptotic convergence rates in the Frobenius norm for estimating 
p(Aq), p(Bq), and their inverses in Theorem 4.3. 

We say that event T(Aq) holds for sample correlation matrix T(Aq) for some parameter 5 n j — > 0, if for all 

j, fjj(A ) = pjj(A ) = 1 and 

max \T jk (A ) - p jk {A )\ < S n t, (19) 

j,k,j^k 

and the event T(Bq) holds for sample correlation matrix T(Bq) for some parameter 5 n m — > 0, if for all j, 

Tjj(B ) = Pjj(B ) = 1 and 

max \f jk (B ) - p jk {B )\ < 6 nm . (20) 

j,k,jj±k 

Theorem 4.3. Suppose that (Al) and (A2) hold. Let A and B be the unique minimizers defined by (2a) 
and (2a) with sample correlation matrices T(^4o) and T(Bq) as their input. Suppose that event T{Aq) holds 
for T(Ao)for some 5 n j and event T(Bq) holds for T{Bq) for some S nm , such that 



J 



4" 1 ! 
^0 lo.off 



B, 



-li 

lo.off 



V 1 = oil) 



V 1 = o(l) and 5 n 
Set for some < e, e < 1, Xb = 8 n j/e and Xa = <5n,m/e 
Then on event T(A ) D T(B ), we have for C = 9(1 + e)/2 and C = 9(1 + e)/2, 

r-i 



A' 



piAo)- 1 p < CA B ^|Vlo,off V V<in(p(4))), 

lo,off 



B- 1 - p(Bo)- 1 f < C'\ Ay /\Ba\ nff V l/tf n MBo)), 



(21) 

(22) 
(23) 



A-p(A ) <2Ck( P (A )) 2 \ b 



A, 



-ii 

lo.off v x ' 



VI, 



and 



B - p(B ) < 2C' K (p(B )) 2 X A J\B \ oS V 1 



Corollary 4.4 provides a bound on the off-diagonal l\ norm on the error matrices for estimating ©o = 
p(Aq)^ 1 and $o = p(Bq)^ 1 , which may be of independent interests. We use it in our analysis of the 
flip-flip algorithm. 

Corollary 4.4. Suppose conditions in Theorem 4.3 hold, and Xb and Xa are chosen as in (21). Let S = 
{(i, j) : 0, i j} and S c = {(i, j) : &Qij = 0, i ^ j}. Then on T(Aq) as defined in Theorem 4.3, 



IA 



S c ll 



< 



1 + e 
1-e 
1 + e 



A 5 | x where A = A 1 - 9 =: A Ao 



(24) 



and \A Ao \ hoS < y^ 9Xb V I A ° 1 1 o.off V Vl A ° 1 1 0,off /^min (p(Ao ) ) • 



Similarly, for A := A# = B 1 — $o» lA^^ < i±| (A^^, where S := {(i, j) : ^oij 7^ 0, i ^ j} and 
S c := ■ &oij = 0, i 7^ j}, /jo/iis o?i T{Bq), and 



k Soli, off 



< 



1 + e. 



9 WlV| ,off V Vl^Lff/^nMSo)) 



14 



Variants of Theorem 4.3 were shown in Rothman et al. (2008) in the context of Gaussian graphical models. 
We give a complete proof of Theorem 4.3 and Corollary 4.4 in Section D. Lemma 4.5 justifies the choices 
of the penalty parameters Xa and Xb as defined in Theorem 3.1, 
Lemma 4.5. Let a n , (3 n < 1/3 be be as defined in Theorem 4.1. Let 

, 2Pn n logV V f) , - 2a n log V V f) 

Onj = z w~ =■ <^3 and d nm = =: C 4 -j= . 

1 — p n y/nj 1 — a n y/nm 

Then, event T(Aq) Pi T(Bq) holds on X^for the sample correlation matrices as defined in (6a) and (6b) 
respectively. 

By Theorem 4.1, we have F (T(A Q ) n T(B )) > 1 - (m ^ /)2 . 

Remark 4.6. In Theorem 3.1, we obtain the penalized estimators B and A as defined in (2b) and (2a) with 
T(Bq) and T(Aq) constructed as in (6b) and (6a) as the input; the penalty parameters are chosen to be: 



Xa q x Cyilog max(m, /)/y mn and Xb x Cslog max(m, f)/y fn, 
which on event Xq dominate the maximum entry-wise errors \T(Bq) — /o(i?o)|max and |T(Ao) — p(Ao)\ max 

respectively. The values of C A ■= ^}\a^ f = ^^{A^ and Cb := ^(5°)^ = ^^{B^ reflect 
how eigenvalues of each component covariance matrix vary across its entire spectrum. The notation Xa 
and Xb thus reflect their dependencies on the eigenspectrum of Aq and Bq. Under the bounded spectrum 
assumptions in (A2), Ca, Cb are taken to be constants. These choices satisfy the conditions in Theorem 4.3 
in view of Lemma 4.5. 



5 Variations on a theme 



It is curious whether or not one can improve upon the Gemini correlation estimators, which will lead to better 
convergence rates for estimating p(Aq) and p(Bq) with (2a) and (2b). To answer this question, we introduce 
a natural variation of the Gemini estimators as given by the Non-iterative Penalized Flip-Flop algorithm de- 
scribed below. To make our discussion concrete, suppose we aim to estimate A* = (a*,y) = m^o/tr^o) 
and B* = = Botr(Ao)/m instead of Aq and Bq. Note that A* has been normalized to have 

tv(A^) = m for identifiability. Clearly ^4* <g> B* = Aq <g> Bq. Denote the k,£ th block of size / x / 
in S n by Sj* and that of size m x m in S n by S^f. We now describe this procedure. 



Non-iterative Penalized Flip-Flop algorithm (NiPFF) 

1. Assume f < m. Initialize = I. Compute T(Bq) based on (6b) as before, and compute B using 
GLasso (2b) with the penalty parameter to be chosen (cf. Lemma 6.1). 

Let Bi = W 2 BW 2 /m. 

2. Now compute the sample covariance A(Bi) using (9): 

A(B 1 )= 1 7 ZL 1 ELs k n e B7l k ; (25) 
Compute the sample correlation matrix T(^4o) with 

f(A ) = W^MB^Wr 1 where W x = diag(2(#i)) 1/2 . (26) 
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Obtain an estimate A(B\) using GLasso (2a) with T(Aq) in (26) as its input, where Xb = Xb 1 is to be 
specified (cf. Remark 6.3 and Corollary 6.4). 

Let Ai = A* = W^A(Si)Wi. 

3. Compute sample covariance matrix B{A\) using (9): 



Compute the sample correlation matrix T(So) with 

f (S ) = W^B(A X )W^ where wj, := diag(S(Ai)) 1/2 



(27) 



(28) 



Obtain an estimate S(Ai) using (2b), with r(So) in (28) as its input, where Xa = X Al is to be 
specified (cf. Theorem 6.6 and Remark 6.7). 
LetS, = W 2 B(A 1 )W 2 . 



6 Analysis for the penalized Flip-Flop algorithm 

In analyzing the Flip-Flop algorithm, we make the following additional assumption. 
(A3) The inverse correlation matrices have bounded |p(Ao) -1 L and |p(So) -1 ^: 

\p{A Q )-\^m and \p(B )-\^ f. 



We use the following notation throughout this section. Let c 

\og d max(m, /) 



loe n 



X 



f,n 



20- 



and X m ,n = 20 



io g Lo g ^^ dd=1 -(e/2Al/2).Let 
log rf max(m, /) 



(29) 



\fjn \/mn 

and for n > A(Cik 2 + C 2 n) logmax(m, /), where k > 0, d = 1/2 and the constant 20 can be replaced 
with 2Jd + C 2 /k, where C\ = -#= » 5.0088, and C 2 = y/Ee « 7.6885. 

v V D7T 

First we bound the entry-wise errors for the sample covariance and correlation matrices as defined in Step 2 
in Lemma 6. 1 and Theorem 6.2. 

Lemma 6.1. Suppose that (Al), (A2) and (A3) hold. Let B and B\ be obtained as in Step 1, where we 
choose for < e < 2/3, 

= z-, 2 " s > -, a for a = C A X m ^ n where C A = || A || F \/m/tr(A ). 
el — a) 1 — a 



Then on event Ai, for A(Si) as defined in (25) 



A(Sx) - A„ 



where p = Xa 



A/, n (l + o(l)) + K 



S" 



l.off ^ + ] - n 



s 



-1 



//<A* 



/or /x = AaXSq)- 1 ^^// + ^(So)- 1 !, // + o(A Ao ). 

(mV/) a • 



(30) 
(31) 
(32) 



Moreover, we have for some constant d < 8, P (Ai) > 1 
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Theorem 6.2. Suppose all conditions in Lemma 6.1 hold. Let T(Aq) be as defined in (26). Then on event 
Aijorrj := A/ jTl (l + o(l)) + p, Vi ^ j 



< A />w (l + o(l)) 



+ 



, _ >■ ■ \PiMo)\) + \pij(A )\- 

where Jl as defined in (31). Then on event Ai, for p as in (32) 

2'/ 



2/1 



r(A ) - P {A C 



< 



1 — rj 



where rj = Ay jn (l + o(l)) + /i. 



(33) 



(34) 



Remark 6.3. On event A\, the random quantities p and rj are upper bounded by p (32) and r\ (34) respec- 
tively, which can be rewritten as follows. 



Define Cf := \p(B ) 1 \ 1 / f + - \p(B ) \ off /f so that 



-i 



-^-^ (Cf + o(l)) and rj = (A/, n + J^^ff)^ + o(l)), 



which suggests that we set the penalty in Step 2 in the following order, 

2t] ol 



Abi 



(1-r?) 



A/, n + 



1 — a 



Cf x Xf >n + A r 



Clearly Cf x 1 under Assumption (A3). We compare the bound in (33) w/f/i f/m? of Theorem 4.1 in Sec- 
tion 6.1. We note that the conclusions of Lemma 6.1 and Theorem 6.2 continue to hold even if e is chosen 
outside of the interval (0, 2/3], so long it is bounded away from and 1. 

We now compute the rates of convergence in the operator and the Frobenius norm for estimating A* with 
A* in Step 2 in Corollary 6.4. The rates we obtain in Corollary 6.4 correspond to exactly those in Corol- 
lary A.3 for the Gemini estimator, with slightly better leading constants, where we replace Xb 1 with X Bo - 
We compare these two penalty parameters in Section 6. 1. 

Corollary 6.4. Suppose (Al), (A2), and (A3) hold. Assume that rj < 1/4. Suppose that on event A\, we 
choose for rj as defined in Theorem 6.2, 



X Bl = 2rj/(ei(l — rj)), where < e\ < 1; 
Then for A* as constructed in Step 2 and some constant 9 < C < 18, 

A* -A* < 2CX Bl a^ max K (p(A )) 2 y/\A^\ oS V 1, 
A* - A* < 2C \B 1 a* t raa, x K(p(A )) 2 \Aq 1 \ 0oS V m, 



(35) 



-i 



and 



A 



A- 1 - AI 1 



< C\ Bl J\A- 



i-il 



lO.off 



V l/(a* Mn (p 2 min (p(A Q ))), 



where a* 



Next we bound the entry-wise errors for the sample covariance and correlation matrices as defined in Step 3 
in Lemma 6.5 and Theorem 6.6. 
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Lemma 6.5. Suppose that (Al), (A2), and (A3) hold. Let A = A(B\) and A\ be defined as in Step 2. 
Suppose we choose \b 1 as in (35). Then on event A\ D £2, far B(A\) as in (27) and rj as in (34), 



where £ = \g 1 



A m , n (l + o(l)) + 
-1 /_ , V 



A 



I m + 

l,oflf 1 — 7] 

T] 



A 



-1 



jm < £ 



for £ = \ Bl \p(Ao) 1 |i, off / m + \p( A o) \/m + o(\ Bl ). 



(36) 



(37) 



Moreover, we have far some constant d < 10, P (Ai D £2) > 1 — u^jy2 ■ 

Theorem 6.6. Suppose all conditions in Lemma 6.5 hold. Let £ = A mjn (l + o(l)) + £, where £ is as defined 
in (37). Then on event At D £"2, 



rij(-Bo) - Py(-Bo. 



A m , w (l + o(l)) 1 rR ^,C + e 
< + \Pij{ B 0)\ Y 



< 



i-C 

2A m , ra (l + o(l)) 

i-C 



+ 



|Pij(^o)| y 



-c 

2£ 



(38) 
(39) 



6.1 Discussion 



We first compare the bound in (33) with that of Theorem 4.1, where we obtain with probability at least 

1 3 



max(m,/) 2 



Vi/j, Tij(A ) - Pij{A ) 



(40) 



for r(^4o) as defined in (6a), (3 = Cs^f.n, and Cb = \\Bq\\ f v7A r (A))- On the other hand, the influence 



of \a q x izr~ on the entry-wise error for estimating pij(Ao) in (33) is regulated through both Cf, which is 
a bounded constant under (A3) (see Remark 6.3), as well as the magnitude of pij(Ao) itself; to see this, we 
write (33) as follows: Vi / j, 



Fij{A ) - pij(A 



< 



A 



1 



V 



1 + 



2a 



^i(Ao)l) + — C f \ Pij (A )\\ (1 + o(l)) 



We note that this rate is at the same order as that in (40). However, when A m) „ <C \* n , the second term 
YZ^Cf \pij{A§)\ x |pij(Ao)| CfCA^m,n is of smaller order compared to the first term. In this case, the 
upper bound in (33) is dominated by the first term on the RHS, and one can perhaps obtain a slightly better 
bound with Theorem 6.2, as the leading term no longer depends on the constant Cb as displayed in (40). 
For example, suppose that Bq is a diagonal matrix so that p(Bq)^ 1 = I, so that |p(So) _1 L // = 1 and 
Cf = 1. The constant Cb = v7 ||-Bo||.p /tr(Bo) > 1 can still be large due to the variability of the diagonal 
entries of Bq. This will lead to improvements in estimating A* as shown in Corollary 6.4, which remains 
true so long as B~ l is sparse in the sense of (A3). 

We next compare the bound in (38) with that of Theorem 4.1, where we obtain with probability at least 
1 



max(m,/) 2 



, for r(£?o) as defined in (6b), 



rVj(-Bo) - Pij(B 



< 



CaX 



a 



:i + \pij(B )\) 



(41) 
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where a = CA^m,n, and Ca = ||A)||p V^/MA))- Before we proceed, we first define the following 
parameter 



C m = \p(A ) 1 \ 1 /m-\ \p(A ) 1 | l off /m so that 

£ < r— ^ — (Cm + o(l)) and £ < (A m , n + -^-C m )(l + o(l)) 
1 — 7/ 1 — 77 

where < E\ < 1 is the same as in (35). Hence we have on A\ n £2, 



Tij (Bo) _ Pij(Bo 
3 



A m , ra (i + (i)) , , R ,,c + e 

< : : + \Pij{B )\ ~ 



i-C 



< 



- ( A m , n (l + \pij(B )\) + \pij(B )\ C m - 



27/ 



V 



c 

1+0(1)) 



by (38), where 2rj/ (1 — 7/) x A mjn + A/ )Tl and we assume that £ < 1/3. Clearly the influence of A^ x 
on the entry- wise error for estimating pij(Bo) is regulated through the quantity C m which is a constant 
under (A3), as well as the magnitude of pij(Bo) itself. We note that these rates are at the same order as in 
Theorem 4.1 when m X /. Moreover, for pairs of where i ^ j, such that |py(Bo)| is small, one can 
perhaps obtain a slightly better bound with Theorem 6.6, as the first (leading) term no longer depends on 
the constant Ca = y/fn \\Aq\\ f /tr(^o) > 1 as needed in (41). Otherwise, suppose that m S> /. Then the 
original estimator in (6b) could be much better for pairs of with a large \pij(Bo)\, as for such pairs, the 
second term is of larger order than the first term in (39). 

Remark 6.7. Finally, we mention that for the B* which we obtain in Step 3, we achieve the same conver- 
gence bounds as in Corollary A. 3, except that we replace Xa h with A J 4 1 , which will be chosen to dominate 
the maximum entry-wise error: assuming that ( < 1/3 and rj < 1/4, 



r(B ) - p(B ) 



< 3A mi „ + 4 max (B )| C r , 



(A/, n + CfCA^m,n/ (1 - a)) 



In summary, for the following cases, we expect that the estimate T(Bo) which we obtain in Step 3 improves 
upon the initial estimate in Step 1. Let CJ = A mjn (l + o(l)) + max^j \p^ (B ) 1 1 Clearly the RHS of (39) 
is bounded by 2(7(1 - C)- 

1. For all i / j, /jy(Bo) is bounded in magnitudes; For example, when |py(Bo)| = 0(\f f/m), then 
(' x A m>n . In particular, 



Vi ^ j, 



rVj(Bo) - Pij(B Q ) 



< 2C_ ^ 2A m , w (l + o(l)) < 

- 1 - c ~ 1 - c ~ r ' 



if Bo is a diagonal matrix. Hence the error in estimating Aq is propagated into the estimate of pij(Bo) 
only when pij(Bo) / 0. 

2. When m and / are close to each other in that the ratio m/f — > const > 1, and simultaneously, C m , 
Cj, and \pij(Bo)\ are small for all i ^ j; Then (' x A m>n + A/, n provides a tight bound. 

A refined analysis on the GLasso given the estimates in Theorem 6.6 is left as future work. 
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7 Numerical results 



We demonstrate the effectiveness of the Gemini method as well as the Non-iterative penalized Flip-Flop 
method, which we refer to as the FF method, with simulated data. For a penalty parameter A > 0, the 
GLasso estimator is given by 

glasso(r, A) = argmin(tr(r0) — log |9| + A (0^ off ), 
e 

where T is a sample correlation matrix. We use the R-package glasso Friedman et al. (2008) to compute 
the GLasso solution. For the two estimation methods we have various tuning parameters, namely A, v for the 
baseline Gemini estimators, and <fi, v for the FF method. In our simulation study, we look at three different 
models from which A and B will be chosen. Let Q = A^ 1 = (uiij) and II = B~ l = (vr^). Let E denote 
edges in Q, and F denote edges in II. We choose A from one of these two models: 

• AR(1) model. In this model the covariance matrix is of the form A = {/o' i_ - 7 '}ij. The graph corre- 
sponding to is a chain. 

• Star-Block model. In this model the covariance matrix is block-diagonal with equal-sized blocks 
whose inverses correspond to star structured graphs, where An = 1, for all i. We have 20 subgraphs, 
where in each subgraph, 8 nodes are connected to a central hub node with no other connections. The 
rest of the nodes in the graph are singletons. Covariance matrix for each block S in A is generated as 

in Ravikumar et al. (2008): Sij = p = 0.5 if (i, j) 6 E and Sij = p 2 otherwise. 

For IT, we use the random concentration matrix model in Zhou et al. (2008). The graph is generated accord- 
ing to a type of Erdos-Renyi random graph model. Initially we set II = 0.25i fxf> where / = 80. Then, we 
randomly select d edges and update II as follows: for each new edge (i, j), a weight w > is chosen uni- 
formly at random from [w m - m ,w max \ where u> max > ^min > 0; we subtract w from ir^ and TTji, and increase 
Ira and -Kjj by w. This keeps II positive definite. For both models of A, we have A* = = A = p(A). 

Let SI* = t -^^-VL and II* = j^njyll. Thus we have J7* = and II* = II for all combinations of A and B in 
this section. 

7.1 Regularization Paths and Cross-validation 

We illustrate the behaviors of the Gemini estimators for each model combination of A, B with m = 400 and 
/ = 80 over the full regularization paths. To evaluate consistency, we use relative errors in the operator and 
the Frobenius norm. For model selection consistency, we use false positive and false negative rates defined 
in Table 1. For each pair of covariance matrices, we do the following. First, we generate A and B, where A 
ism x m and B is / x /. Let A 1 / 2 and B 1 / 2 be the unique square root of matrix A and B respectively. Let 
T and T' be a set of values in (0, 0.5]. Now, repeat the following steps 100 times: 

1. Sample random matrices . . . , X^i.i.d. ~ A// )TO (0, A® B): 

x (t) = B 1 / 2 Z(t)A 1/2 , where Z i5 {t) ~ JV(0, 1) Vi, j,Vt = 1, ... ,n. 

Compute the sample column correlation corf co i as in (6a) and row correlation coff row as in (6b). 

2. For each A E T and v G T': 
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(a) Obtain the estimated inverse correlation matrices A 1 , and B 1 with glasso(corf co \ , X) and glasso(coff row , v) 
respectively. Let 0(A) := A~ x and 11(f) := B~ l , where A* and B* are as defined in (17). 

(b) Let E(X) denote the set of edges in the estimated 0(A). Now compute FNR(A) and FPR(A) 
as defined in Table 1. To obtain FNR(^) and FPR(^), we need to replace E(X) with F(v), 
which denotes the set of edges in H(v), E with F, and m with /. Compute the relative errors 
||0(A) — 0||/ ||0|| and — H 1 1 / ||II||, where ||-|| denotes either the operator or the Frobenius 
norm. 



Table 1: Metrics for evaluating E(X) 



Metric 


Definition 


False Positives (FPs) 
False Negatives (FNs) 
True positives (TPs) 
True Negatives (TNs) 
False Positive Rate (FPR) 
False Negative Rate (FNR) 


# of incorrectly selected edges in E(X): \E(X) \ E\ 

# of edges in E that are not selected in E(X): \E \ E(X) 

# of correctly selected edges: \E(X) n E\ 

# of zeros in E(X) that are also zero in E 
FPR = FP/(FP + TN) = FP/((™) - \E\) 
FNR = FN /(TP + FN) = FN/\E\ 



After 100 trials, we plot each of the following as A changes over a range of values in T: (FNR + FPR) (A) 
for E(X), where FNR, FPR are averaged over the 100 trials for each metric, and the average relative errors 
in the operator and the Frobenius norm. Similarly, we plot these as v changes over a range of values in T . 
Figure 2 shows how these three metrics change as the l\ regularization parameters A and v increase over 
full paths where covariance A comes from either AR(1) or the Star-Block model, and II comes from the 
random graph model. These plots show that the Gemini method is able to select the correct structures as 
well as achieving low relative errors in the operator and the Frobenius norm when A and v are chosen from 
a suitable range. In addition, as n increases, we see performance gains over almost the entire paths for all 
three metrics as expected. Other model combinations of A, B which are not shown here confirm similar 
findings. 

In Figure 2, we also illustrate choosing the penalty parameters A and v by 10-fold cross-validation. To 
do so, we run the following for 10 trials. In each trial, we partition the rows of each X^\t = 1, . . . , n 
into 10 folds. For each fold, the validation set consists of the subset of rows of X^\...,X^ sharing the 
same indices and its complement set serves as the training data. Denote by corr> and corf y the column- wise 
sample correlations based upon the training and the validation data, which are computed in the same manner 
as in (6a). We define scores (A) = tr(©ACoff y) — log \@\\, where Q\ = glasso(cor?y, A). The final score 
for a particular A is the average over 10 trials (with 10 folds in each trial) and the one with the lowest score 
is chosen to be Xcv- Similarly, we use column partitions to obtain vcv- 

7.2 ROC comparisons 

In this section, we compare the performances of the two methods, namely, the baseline Gemini and its three- 
step FF variant over the full paths by examining their ROC curves. Each curve is an average over 50 trials. 
We fix / = 80, m = 400, n = 1. To simplify our notation, we summarize the penalty parameters which we 
use for indexing the ROC curves as follows: 

A = Xb v = Xa 4> = X Bl v = Aa x 
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A: ARID: FPR+FNR A: AR(1): Relative error in L2 A:AR(1): Relative error in Frobenius Norm 




0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 



A: ARID: FPR+FNR A: AR(1): Relative error in L2 A:AR(1): Relative error in Frobenius Norm 




0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 



A: Star-Block; FPR+FNR A: Star-Block; Relative error in L2 A: Star-Block; Relative error in Frobenius Norm 




0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 



Figure 2: m = 400, / = 80, n = 1. B^ 1 follows random graph model with d = 80 and w £ [0.1, 0.3] 
throughout these plots. In the top two panels, covariance A follows the AR(1) model with p = 0.5; for the 
bottom two panels, A follows the Star-Block model. The top and the third panel are for O(A); the second 
and the bottom panel are for n(zv). As penalization parameter A or v increases, FPs decrease and FNs 
increase; therefore, we see the bowl-shaped curves in all plots on the left column. The relative errors also 
first decrease and then increase before they level off. This happens because at first, decreased FPs clearly 
help to reduce the estimation errors; however, as penalization further increases, the estimated graphs are 
missing more and more edges until the inverse covariance estimates have only diagonal entries in the end. 
Solid and dashed horizontal lines show the performances of Gemini for cross- validated tuning parameters: 
in the top two panels, Xcv = 0.16 and vqv = 0-08 for n = 1, and X' cv = 0.08 and v' cv = 0.04 for n = 3 
respectively. For the bottom two panels, Xcv = 0.16 and vqv = 0.10 for n = 1, and X' cv = 0.06 and 
v 'cv = 0-03 for n = 3 respectively. The cross-validated tuning parameters tend to stay close to the point of 
A or v which minimizes the relative error in the Frobenius norm. 
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A:p=0.5 B:d = 180,w:[0.1.0.3] 



A:p=0.5 B:d = 180,w:[0.1 ,0.3] 



ROC comparison for A 




Gemini 
FF v, 
FFv-> 
FF v 3 



ooo o55 oTo 0^5 020 025 o5o 035 ooo 005 0T0 ols o56 0^5 o5o 035 ooo o5s oTio 0TY5 020 025 o56 035 

FPR FPR FPR 



p=0.5;d=180 l 
p=0.5;d=180 l 
p=0.5; d= 90 v 
p=0.5; d= 90 v 



[0.6. 0.8] 
[0.1. 0.3] 
[0.6, 0.8] 
[0.1, 0.3] 



A:p=0.5 B:d = 180,w:[0.6,0.8 



A:p=0.5 B:d = 180,w:[0.6,0. 




Gemini 
FF v, 
FF v 2 
FF v 3 



0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 
FPR 




0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 

FPR FPR 



A:p=0.5 B:d = 90,w:[0. 1,0.3] 



Gemini 
FF v, 
FFv 2 
FF v 3 



OOO O05 oTo 0/15 O20 ol?5 O30 035 
FPR 



A:p=0.5B:d = 90,w:[0.1,0.3] 



Gemini 
FF .(M 
FF<|> 2 
FFo 3 



)0 O05 0T0 015 O20 025 O30 035 



ROC comparison for A 



p=0.5;d=180\ 
p=0.5;d=180\ 

p=0.7; d=l80\ 
p=0.7;d=180> 



[0.6. 0.8] 
[0.1. 0.3] 
[0.6. 0.8] 
[0.1, 0.3] 



OOO O05 O10 0/15 O20 025 O30 035 



A:p=0.5 B:d = 90,w:[0.6,0.8 



A:p=0.5B:d = 90,w:[0.6,0 




Gemini 
FF v, 
FF v 2 
FF v 3 



0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 
FPR 




0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 025 0.30 0.35 

FPR FPR 



Figure 3: m = 400, / = 80, n = 1. Solid lines are for Gemini. Plots in the left column are for A 
and the middle column are for B. The three dotted lines in each plot on the left column correspond to 
the three optimization criteria v\ y v% as specified in Step 3. For the middle column, they correspond to 
(ii, <pi), (^2, $2)1 (hi <p3)>&$ specified in Step 4. In the right column: in top two plots, we choose A from 
AR(1) model with p = 0.5 while changing the settings of B^ 1 as in Table 2; In bottom two plots, we choose 
A from AR(1) model with p = 0.5 or 0.7 while changing the settings of B~ l with d = 180. Dotted lines 
in plots for A on the right column are chosen according to the optimization criterion ui, and in plots for B, 
they are chosen according to the criterion (pi. 
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To illustrate the overall performances of the baseline Gemini method for estimating the graphs of 0, and 
n, we use pairs of metrics (FPR(A), 1 - FNR(A)) and (FPR(v), 1 - FNR(z^)) respectively, which we 
obtain as the average over 50 trials of Step 1 and 2 as described in Section 7.1. To plot the ROC curves for 
the FF method, we start with estimating II with the Gemini estimator. Due to computational complexity, 
we specify the input parameters of the subsequent steps sequentially. These choices are not feasible in 
practical settings. We run through this idealized example for the sake of comparing with the baseline Gemini 
estimators. Repeat the following 50 times: Let T := {0.02, 0.04, . . . , 0.72}. 

1-2 Run Step 1, 2 as in Section 7.1, while only computing the metrics for II(zv), where v G T. 

3 To execute the second step of the FF algorithm, we use the following three outputs from Step 2 of the 
current procedure to act as B\ to compute A(B\). We choose the output B\ such that its corresponding 
v is chosen to be v\ = argmin t , g r(FNR + FPR)(V), v% = argmin^gT ||II(z/) — ZI 1 1 2/ ||n|| 2 , and 
1/3 = argmin^gr ||II(V) — n||^/ ||II|| ^. Denote these by B\, B\ and B\. We now run the second step 
of the FF method for each B\, where i = 1,2, 3, with penalty parameter cj> G T changing over the full 
path while obtaining the inverse estimators Q l (4>) for Q, and computing FNR* ((/>), and FPR J (0) for 
each estimated edge set. These contribute to 3 ROC curves for estimating the edges in E. 

4 To execute the last step of the FF method, we use the following three outputs from Step 3 as A\ to com- 
pute B{A\). We choose the output A\ such that its corresponding [i, 0) is chosen to be optimal with 
respect to one of the following metrics: 4>\) = argmin^ g T i j =1 .2,3(FNR l + FPR l )(0), (i 2 , 4>2) = 

argmin0 eTii= i i2 ,3 \\^{4>) ~ ^h/ ||^|| 2 , and (*3,^3) = axgmin^ 6 T ii= i j2 ,3 ||^ l (0) - ^||f/||^Hf- 
The choices then become (v , i j , 4>j),j = 1, 2, 3, which we simply denote by <pi, 2 , <fe- Thus, there are 

again three choices for A\. We now run the third step of the FF method for each B{A\) with v G T 
changing over the full path, while computing FNR J (t>) and FPR J (t>), where j = 1, . . . , 3, for each 
estimated edge set. These contribute to 3 ROC curves for estimating the edges of F. 

The ROC curves are plotted in Figure 3 using pairs of metrics (FPR* (0) , 1 — FNR* ((f)) ) and (FPR J (v),l — 

FNR 3 (v)), i,j = 1,2,3, which are averaged over 50 trials. Throughout the plots on the left column of 
Figure 3, we see clear performance gains of the FF method over the baseline Gemini on estimating SI = A -1 , 
when the initial penalty v is chosen properly. For II = B v in the middle column, we do not always see 
improvements when w is drawn from [0.6, 0.8]. We do see some improvements in case w is drawn from 
[0.1, 0.3] and when the total correlation p 2 B (see definition below) is small. Overall, the performance gains 
for n are not as substantial as those for Q. These observations are consistent with our theory and discussion 
in Section 6.1. 

7.3 Summary 

We use the following metrics to compare matrix B and A across different models or parameters: 

1. Total correlation: p\ = £ 4<J p%{A)/(™) and p% = Z t<j 

2. \\B\\ F /tr(B) and \\A\\ F /ti(A): these affect the entry-wise error bound in sample correlation esti- 
mates for pij(A) and pij(B), for all i ^ j, for the baseline Gemini estimators. 

3. The pairs of ^-metrics (|p(^) _1 |i i0ff , \p{B)-\) and (\p(A)-\ oS , \p(A)~ 1 \ 1 ). 
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The total correlation metric comes from Efron (2009). We use it to characterize the average squared mag- 
nitudes for correlation coefficients of p(A) or p(B). They are clearly relevant for the FF method as the 
entry-wise error bound for estimating Pij(A) and pij(B), for all i / j, depends on the magnitude of the 
entry itself (cf. Theorem 6.2 and 6.6). We summarize the metrics for B in table 2. 



Table 2: Metrics for comparing the ROC curves 



Metric 


d=90, w: [0.1, 0.3] 


d=180, w:[0.1,0.3] 


d=90, w: [0.6,0.8] 


d=180, w:[0.6,0.8] 


Pi 


0.053 


0.06 


0.094 


0.12 


\\B\\ F MB) 


0.128 


0.13 


0.155 


0.16 


l\ -metrics 


(55, 152) 


(71, 166) 


(99, 225) 


(102, 216) 



We summarize our findings across the ROC curves in the right column in Figure 3. First we focus on the case 
when A is fixed and B is changing. When II follows the random graph model, we observe that for both the 
baseline Gemini estimators and their FF variants, the performances in terms of estimating edges for Q, are 
better when the weights for II are chosen from [0.1, 0.3] for both d = 90 and d = 180. Here the sparsity for 
II is not the decisive factor. This is consistent with our theory, in view of Table 2, that \\B\\ F /tr(f?) affects 
the entry- wise error bound for the baseline Gemini correlation estimate T(A) as shown in Corollary 2.2, 
and the pair of metrics (|p(_B) _1 L off , ^(i?) -1 ^) affect that for the FF correspondent in (26) as shown 
in Theorem 6.2. The performances in terms of edge recovery for II take a different order. The sparse 
random graphs with d = 90 see better performances than those with d = 180 for both the Gemini and 
the FF methods. For graphs with the same sparsity, the one with the larger weight performs better. This is 
consistent with our theory in Section D.2. 

Next we choose two covariance matrices for both A and B: for B, we choose the two cases with different 
edge weights with d = 180; and for A, we set the parameter p to 0.5 or 0.7. The metrics for the two choices 
of A are: for p = 0.5, we have p\ = 0.04, \\A\\ F /tx{A) = 0.065, and ^-metrics = (532, 1198). The 
corresponding numbers for p = 0.7 are: 0.07, 0.085, and (1095, 2262) respectively. 

First we note that the two cases of B show the same trend when A is fixed. In the right bottom two plots 
in Figure 3, for the graphs of Q, we find it easier to estimate when their covariance matrices come with 
parameter p = 0.7, which results in larger l\ metrics, and hence larger weights on the inverse chain graph; 
for the graphs of II, we observe relatively larger performance gains when p = 0.5 for A, with the most 
significant occurring when w G [0.1,0.3] for II, where both p(A)^ 1 and p(B)^ 1 have smaller i\ metrics 
and the total correlation p 2 B = 0.06 is also small. The least improvement we see occurs in case all three 
metrics are large: p = 0.7, w G [0.6,0.8], and p\ = 0.12. These findings are consistent with results in 
Theorem 6.2 and 6.6, where we explicitly show the influence of the pairs of ^i-metrics on the error bounds 
for the FF sample correlation estimates. 

8 Conclusion 

In this paper, we presented two methods for estimating the graphs in a matrix variate normal model. The 
baseline Gemini method is rather simple and provides the same rates of convergence as the Non-iterative 
Penalized Flip-Flop method in the operator and the Frobenius norm. Under sparsity conditions as detailed 
in (Al) and (A2), the NiPFF method improves upon the baseline algorithm in estimating Aq 1 , which is 
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assumed to be the one with the larger dimension, so long as p(Bq) 1 satisfies a certain additional sparsity 
condition, namely, its l\ metrics are bounded in the order of its dimensionality. 

It remains an open question whether the iterative methods will help achieve better convergence rates in 
the one-sample or finite-sample instances. We leave the answer to this question in future work; in the 
current work, we illustrate that under suitable conditions, our three-step approximation could indeed improve 
upon the baseline algorithm. However, we show in both theoretical analysis and simulation results that the 
performance gains for estimating Bq 1 using the NiPFF method at the third step are rather limited; hence we 
do not advocate iterating beyond the first three steps. 

Although our primary interests are in estimating correlations and partial correlations among and between 
both rows and columns when X follows a matrix variate normal distribution, our methods clearly can be 
extended to the general cases when the data matrix X follows other type of matrix-variate distributions. 
Theoretical analysis on such generalized models will be left as future work. Finally, Hoff (2011) showed 
that the class of separable covariance models for random arrays of arbitrary dimension can be generated 
with a type of multilinear transformation of an array of independent, standard normal entries. It will be 
interesting to apply our methods to the array normal setting. 
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A Estimation of the covariance matrices 



Theorem A. 1 and A.2 show the rates of convergence in the operator and the Frobenius norm for estimating 
both Aq ® Bo and its inverse, which will depend on n through Xa and A b , which in turn depend on the 
rates of convergence in entry-wise max norm in estimating p{Aq) and p(£>o) respectively. 
Theorem A.l. Let A Xb := minjA^p, \b }- Suppose (Al) and (A2) hold. Let us define for c = 
log \og(mvf) > d = l — (c/2 A 1/2). Suppose that Xa ,^b < 1. andfor some < £\,e 2 < 1 

Aa := 3a n /ei and X Bq := 3/3 n /e 2 
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where a n and j3 n are as defined in Theorem 4.1. Then on event Xq, where P {Xq) > 1 — max ^ yp 
for 9 < C, C < 18, 



, we have 



A <g> B - A <g> B 



X Ao A Ago 
2 < 2 " °" 2 ll-°o|| 2 



+ 2C\ Bo a 

max 

lOoff v 1 



+ 2C'X Ao b max \\Ao\\ 2 k {p{B )Y J\Bq\ V 1 



(p(A ))^(p(B )) VUo 
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VI, 



and for 19/2 > C,C > 5, 
J4®B~ - 
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Aa A A^o I, 



\\2 II ||2 
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6CC"Aa A Bo ^/ l^o^o.ofi v 1 V l^o Ncoff v 1 
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Theorem A.2. Suppose all conditions in Theorem A.l hold. Suppose (Al) and (A2) hold. Then on event 
X , we have for 9 < C, C < 18, 



A <g> B - A ® B 



X Ao A Ago .. h 

< 2^ ||A)®-do|If 



+ 2CA Bo 

"max 



lO.off 



V m 



"max 

\\A \\ f k{ P {B ))\\B^ 



>-i| 



lO.off 



v/ 



+ 7CC AAoAsoamaxfrmaxft 

(p(^o)) 2 « (p(Bo)) 2 ^|^o 1 | ,ofr Vm Vl^ 1 |o,off V/ ' 

and/or 19/2 > C,C" > 5, 



-i 



A® B - Aq 1 ®B 



< 



Xa A Ab 



Ur 1 



Bn 



o iif II o Hf 



+ AboH^o 1 ) 



2CJ A 



l 



0,off 



V m 



«min^min(KA))) 



+ +Aa ||A) IIf 



1M 2C '^\ B o\ oS Vf 



&minVmin(P(Bo)) 



+ 



28CC"A j 4 Ab J A 1 



0,ofT 



V m 



0,off 



v/ 



50min"min 



We show an outline for proving Theorems A. 1 and A.2 in Section E, with actual proof in Section F. 1 and F.2. 

Corollary A. 3 states the rates of convergence in the operator and the Frobenius norm for estimating A* and 
B*, where recall that 

fl* max — ITlciXj (I* a and 

&*,min — milli CI*, 22- Denote by fr*,max — IELcIX^ h*^a and 

^*,min — niin^ 6* 22* 
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Corollary A. 3. Let A* , B* be as defined in (16) and Let A* , B* be as defined in (17). Suppose Assumptions 
(Al ) and (A2) hold. We have for some absolute constants 9 < C, C < 18, on event Xq, 



A* 


A* 


< 
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B* 


— Bjf 


< 

2 



< 4^CX Bo a^ x K (p(A )) 2 ^/l^o'lcoff v 



1. 



< 2C'AA 6*,max'« (p(-Bo)) \ \ B 



lO.off 



VI, 



For the inverses, we have for 5.25 < C,C < 19/2, on event Xq, 

^ ACX B y/\A^\ oB V l/(a*,minVmin(^(^0))) 

< 2C'A Ao ^| J B - 1 | 0i0ff V l/(6,, min ^ in (p(i?o))) 
We have the following rate of convergence for the Frobenius norm: for 9 < C, C < 18, on event Xq, 
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A.l Some convenient bounds 

Throughout our proofs, we need the following bounds: 
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A.2 Proof of Theorem 3.1 



Theorem 3.1 is a simple corollary of Theorem A. 1 . We now insert the bounds as in (42e) in Theorem A. 1 to 



obtain 
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For the inverse, we plug in the bounds as in (42c) and (4 2d) in Theorem A. 1 to obtain 
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A.3 Proof of Theorem 3.2 



First we state the following bounds. 
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We now insert the lower bounds as in (43c) and (43d) in Theorem A.2 to obtain 
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Similarly, plug in the lower bounds as in (43a) and (43b) in Theorem A.2 to obtain under (Al) and (A2), 
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To see the last equality: 



I j4 "H V?71 / I -A x I V?T1 

1. If \Aq\ oS < 777, then AboV " t° g = A b ; otherwise, A Bo V ° = o f 1 



2. If I^q- 1 !^ < /, then A Ao y |i?0 ^ offV/ = A Ao ; otherwise 
Moreover, the expression for 5 can be simplified as follows: 

1. If | A 1 | Q off < 777 or X 777, and I-Bq" 1 | off < / or x /, then 5 x A Bo + A^ . 
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2. If 1 < n < log(m V /), then 

5 = 



Of — + — + A Ao + Xb 



1 



1 

log d max(m, /) 



1 1 



Aa + ^B 



*nf y/nm / 

where d = 1 for re = 1, and d > 1/2 for n < log(m V /). 
The upper bounds on 5' are dominated by that of 5 under each case; hence the same conclusions hold. □ 



B Concentration inequalities for the Gaussian Chaos 



In order to prove Lemma 4.2, and Theorem C.l, we need the following large deviation bound on the sum 
of i.i.d. random variables which are Gaussian chaos of order 2. The proof of Theorem B.l appears in 
Section B.l. 

Theorem B.l. Let (gi, ... ,g m ) be a random vector with m independent N(0, 1) variables and Y = 
YT=iT^=i A ij9i9j-Let 



Z :-- 



Y-ElY] 1 / m m m ' 
7=T^ =g EE 9k9jA kj + J2(9k ~ VAkk 



and Z\, . . . , Z n be independent copies of Z. Then Bernstein 's inequality yields 

^ Zi > t I < 2 exp ( - 

n 



1 n 



= 1 



2vi + 2Wt J ' 



and an improved bound based on the Bennet's inequality yields the following: 

_2 
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> r < 2 exp 



nr 



Vi+Wt + Vi j 1 + 2 W- t 



where v 1 = < 00 and W ~ 



(44) 



£ < oo with C x = -#= w 5.0088 a«J C 2 = V8e 7.6885. 



B.l Proof of Theorem B.l 

This proof follows exactly the sequence of arguments in Zhou et al. (2009). Hence we only sketch it here 
for completeness. Applying a general bound of Ledoux and Talagrand (1991) for Gaussian chaos gives that 



E[|Z| r ] < (r- l) r (E \Z\ 2 ) 



vr/2 



(45) 



for all r > 2. This bound guaranties that factorial-type moment estimate for Gaussian chaos, which makes 
application of Bennet's inequality possible. We first state the following claim derived from (45), whose 
proof appears in Rauhut et al. (2008), which we omit. 
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Claim B.2. (Rauhut et al. (2008)) Let W = e(E [\Z\ 2 ] f/ 2 and s = ^=E \Z\ 

Vr > 2, E[|Z| r ] < r\W r - 2 s/2. 

Clearly the above claim holds for r = 2, since trivially E [\Z\ r ] < r\W r ~ 2 s/2 given that for r = 2 

2e 



r\W r ~ 2 s/2 = 2W' A ~' A sj2 = s 



2-2, 



Qir 
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where 
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Set 
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A\\ F < oo 
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where the last inequality holds by assumption, and for all i, 



2e 



=E 
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4e 1 „ , ll2 2.5044 
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6-7T "I 



\A\\ F < oo. 



Thus for independent random variables Zi, Vi = 1, . . . , n, we have 

E \Z\\ < r\W r '- 2 Vi/2. 

We now apply the following Theorem B.3, the proof of which follows arguments from expression (1) to (7) 
in Bennett (1962), where one needs to replace a 2 with v and a 2 with Vi everywhere in between (1) to (7) 
while noticing that all arguments will necessarily go through despite these substitutions. 
Theorem B.3. (Bernstein's and Bennet's inequalities Bennett (1962)) Let Z\,...,Z n be independent 
random variables with zero mean such that 

n\ZiU <r\W r - 2 Vi/2, 
for every r > 2 and some constant W and Vi,\/i = 1, . . . , n. Then for r > 0, 

_2 



i=i 



> t < 2 exp 



2v + 2Wt 



with v = Y^i=i v i> An improved bound is also given by 



i=l 



> r < 2 exp 



v + Wt + Vv 2 + 2Wvt 



(46) 



Remark B.4. The statement in Theorem B.3 corresponds to Expression (7) in Bennett ( 1962), once we re- 
place a 2 byv = 2~27=1 Vi > which i s typically known as the Bernstein 's inequality; Similarly, (46) corresponds 
to Expression (7a) in Bennett (1962). 
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We can then apply the Bernstein's Inequality to obtain the following: 



1 n 
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> nr 
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with Ci = « 5.0088 and C2 = V8e ~ 7.6885 as desired. Similarly, we apply the Bennet's inequality 



in (46) to obtain (44). □ 



B.2 Proof of Theorem 4.1 

Throughout this proof, we assume that event Xq as defined in Lemma 4.2 holds. First we obtain the large 
deviation bounds on the estimated correlation coefficients. We prove it only for Y{Bq), as a similar argument 
also works for T{Aq). On Xq, we have 



Vi, 



and hence 
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The last inequality in the theorem statement follows immediately by summing over the large deviation 
inequalities on the £| norms for the row vectors and column vectors respectively. In more details, we have 



1 ELi - tr(4))tr(B 



n 2-jt=l 2-ui=l ll x 



tr(A )tr(B 
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< min {tr(^ )/7o, tr(B )mT Q } = (a n A /^)tr(i4o)tr(B ). 
The theorem is thus proved. □ 
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B.3 Proof of Lemma 4.2 

The conclusions of Lemma 4.2 follow from Theorem C.l. In proving Theorem C.l, we will focus on the 
case when n > log(m V /). Hence we focus on the case when n < log(m V /) in the proof of Lemma 4.2. 

Let us first define a set of mean-zero random variables Za, Zb, and Z(i, j £ {1, . . . ,m},i 7^ j. For 
two random variables Y, Z,Y ~ Z means that they follow the same distribution. 

Proposition B.5. Let g\,Q2, • • • ~ N(0, 1). We have for row vectors y 1 , . . . ,yf and column vectors 



, x m of X ~ A// m (0, Aq <g) Bq), where Aq = (a^) and Bq = (fey), 
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be the unique symmetric square root of the positive definite matrix Bqujx, where Bq^^ 



is the submatrix of Bq with rows and columns indexed by {i, j}. Define 
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We assume that max(m, /) > 2. We first apply the Bennett's Inequality as stated in (44) in Theorem B.l 

£L\ og d-c/2( m v where 



rriT 



20 



II || 



with n = (logmax(m, /)) c , where 1 > c > 0, and e - 
d = 1 — c/2 > c/2, to obtain a concentration bound for Y^t=i ^t, where Z%, . . . , Z n are independent copies 
of as defined in (48). Assuming that max(m, /) > 2, and hence log d ~ c / 2 (m V /) > 1, we have 
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Then Z\(i, j) , . . . , Z n (i, j) are independent copies of Z(i,j) as defined in (49). We next apply the Bennett's 
Inequality as stated in Theorem B.l with n = (log max(m, /)) c , where < c < 1, and e = -y/m/2ro 

1 ||-Ao||f (log max(rr 
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where v\ = 2.5044^ ||^4o||^» an d W = ^= ||>lo||^.. Following similar arguments, we get, assuming that 

max(m, /) > 2, 
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> t i \ < m{m- 1) 

"' ~~ max(m, /) 4 



The lemma is thus proved by summing up the probability for all bad events and using the fact that for 

(m V /) > 2, 

m(m - 1) + 2m + /(/ - 1) + 2/ 
max(m, /) 4 
^ 2 max(m, /) 2 + 2(m V /) ^ 3max(m,/) 2 



max(m, /) 4 



max(m, /) z 



□ 



It remains to prove Proposition B.5. 

Proof of Proposition B.5. First observe that when i = j, 



I o\\ 2 



x xj = (gi, . . . , g m )b jj A (g 1 , g m ) T = bjj ^ ^ a ki g k g e , 



k=l k=l 1=1 

where gi, . . . , g m ~ N(0, 1) and the mean-zero normal random vector yi ~ A/" m (0, bjjAo) can be expressed 



as: 



yi = (x^ =1 = (A b 1J ) 1 / 2 (g 1 ,...,g m ) T . 



3 

Statements in (48) and (47) follow immediately. The rest of the proof follows exactly that of Theorem C.l, 
in the decorrelation step, for M = I, after we write (y l ,y 2 ) = zCfcLi x\kx\ = ^ZfcLi YllLi ^kl x \ x x \- 

□ 



C A general theory on concentration inequalities 



Suppose that we have n i.i.d. random matrices X(l), X(2), . . . , X{n) ~ A// m (0, Aq <g) Bq), where Ao 
(cijk) and Bq = (bjk) are positive definite. Then 



££*=iv(*) £ ®v(*)* 



S e n k where E 
S^ fc where E 



x(^<g>a;(t) fe 

y{tf®y(tf 



nek Bo, 
bekAo, 



where x(t) 1 , . . . , x(t) m G are column vectors, and y(t) 1 , . . . , y{ty G R m are row vectors of matrix 



, m 



X(t). We now provide a large deviation inequality for the weighted sum of submatrices S„ , i, k = 1 
(or sum ofS*!°,e,k = l,...,f), where the weights correspond to entries in an m x m matrix M (or in an 
/ x / matrix N). More precisely, we prove Theorem C.l. 
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Theorem C.l. Let mV/>2. Let matrices M mxm and Nf X f satisfy the following condition: 



A ||M||i < oo and \ \\N\\ 2 F < x . 



(51) 



Denote by D 
Then, we have for d 



A^MA^ 2 



I \fm and D' 



1 - (c/2 A 1/2), where c 



bI /2 nbI /2 

log n 



/yff. Suppose that n < log(m V /). 



m 



log log(mV/)' 

diag( J B )- 1/2 (E£=i ET=l M kiS e n k ) diag^o)- 1 / 2 - tr(A M)p(B ) 
log d max(/, m) 



< 20 D- 



imn 



diag(^ c 



-1/2 



£*Li TZi N uSn diag(A) 



-1/2 



tr(5 iV)^(Ac 



< 20D 



,log d max(/, m) 



Otherwise, let n > A(C\k 2 + C2K) logmax(m, /), where k > 0. 77ze?i, w/f/z probability 1 — max (^ jp- , 

above inequalities hold with d = 1/2 a«<i f/ze constant 20 ^e/rcg replaced with 2-sJ C\ + C^fn, where 
Ci = -#= rj 5.0088, and C 2 = \/8e rj 7.6885. 

V 07T 

Remark C.2. i. Theorem C.l is stated as a matrix max norm bound for convenience; when we look at 
the individual entry -wise bound, we will come to the same conclusions as in Lemma 4.2. Clearly, 
Theorem 4.1 and Corollary 2.2 are special cases of Theorem C.l when we set M = I. 

Ill 1 1 2 1 1 2 

2. Suppose the operator norm of A I is finite, then (51) holds given — ||M||_p < ||M || 2 . When we put a 
stronger (in terms of constants) lower bound on n by increasing k, then C2 disappears from the RHS 
of the inequalities, and the leading constants become ~ 4.472, which is not optimized in this proof. 

For the special case when M = A^ 1 and N = B^ 1 , we have the following corollary of Theorem C.l. 
Corollary C.3. Suppose thatn < log(mV/). Then, we have for d = 1 — (c/2 A 1/2) where c = log lo °^ v j) » 



^diag( J Bo)- 1/2 S(^o)diag( J B )- 1/2 - p(B ) 
}diag(^ )- 1/2 ^(So)diag(A )- 1 /2 _ p(j4o 



< 20 



log rf max(/, m) 



'ran 



< 20 



log rf max(m, /) 



Otherwise, let n > 4(C\K 2 + C2K) logmax(m, /), where k > 0. Then, with probability 1 — max (^ 

the above inequalities hold with d = 1/2 and the constant 20 being replaced with 2y/ C\ + Qz/k, where 
d = -|f= rj 5.0088, and C 2 = V8e « 7.6885. 



/ 6"7T 



C.1 Proof of Theorem C.l 

Our analysis will focus on estimating Bo and p(Bo) when n > log(m V /), which applies to Aq and 
p(Aq) with minor changes on notation. In case n < log(m V /), the proof is similar to that of Lemma 4.2 
in the union bound part, while for decorrelation, it follows the same arguments as below. In particular, all 
concentration bounds on random variables Za,Zb, and Zy,Vij as defined and bounded in the proof of 
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Lemma 4.2, will apply to the corresponding ones in the current proof after we replace || A)|| 2 with D and 
|| So || 2 with D' everywhere. 

Decorrelation. The quantity which we are interested in is the following weighted sum of random matrices: 

mm 

Y J Y,(^nM k e-a ek M ke B " 



k=l 1=1 

1 v^n 



E?=i (EZLi ET=i M(k, l)x(tY ® x(tf - (A ,M)B ). 
Let us focus on a particular entry fyj in matrix So and rewrite the sum above as follows, 

m m 



fc=i t=\ 

I Et=i (5Xi ET=i M ke x(t)f x x(t)* - ( A), M ) 6y 

where each summand ££i M kl x(t)\ x x(t)J - ( M ) 6ij is a Gaussian chaos of order 2. We 

explore the concentration of the sum on the RHS in the next subsection. We first need to decorrelate the 
vectors in the sum. First observe that when i = j, the Gaussian random vector (a;(l)^)^ =1 involved in the 
sum is of size m, with covariance matrix being bjjAo. Without loss of generality, we write (x(l)^)^L 1 = 
(Aobjj) 1 / 2 (gi, . . . , g m ) T , where gi, . . . , g m ~ N(0, 1) and replace the sum with the following expression: 

m m 



^M M x(l)j x x(l)J = (g h . . .,g m )b jj Al /2 MAl /2 {g l , ...,g r , 

k=l 1=1 

m m 

= Y.Y. h n M 'u9k9i where M'=iJ /2 MiJ /2 

k=l 1=1 



We next fix i = 1, j = 2. Then for vectors (x(l)f) , —1 and (^(l)!)^^, we concatenate them to become a 
Gaussian random vector of size 2m with covariance matrix 



Cov 



(x(i)[)T =1 , Ki)tmi) = ( In III )® A ° := A °- 



where i?o,{i,2} * s ^ e submatrix of So with rows and columns indexed by {1, 2}. Let I 11 %3 ) be the 

unique symmetric square root of the positive definite matrix Bqu^x. Thus for gi, ... , g2 m ~ N(0, 1), we 
write for the concatenated vector 



{{x(l)[)T =l ,{x{l) k 2 )t =1 ) = Bl{ 2 m ®Ay 2 { gi ,...,g 2 

{gi, ■■■,92 

and ££M M x(l)lxx(l)§ = (x(l)\ . . . ,x(l)?)M(x(l)l . . . ,x(l)?f 



,1/2 ,1/2 \ 

A 1 / 2 A 1 / 2 



k=l t=l 



2m 2m 

i=i j=i 
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where 



M' = 



011012 011022 ^ « A 1 ' 2 MA 1 ' 2 
C12C12 C12C22 ' 



and E 



2m 2m 

i=i j=i 



tr(M') = 6 12 tr(A)M). 



We have ||M'|£ = (c? 2 + + c i 2 ) 



6116: 



22 



Applying the union bound. Let g^, k = 1, 2, . . . , be independent standard Gaussian random variables. For 
j = 1,...J, and 

^Mu/b 33 -(a ,m) = \ y: mjj) 

v k=l l=\ t=l 
where Z\ , . . . , Z n are independent copies of mean zero random variable 

../mm \ 

Z ^ := "7= E E M ^fc2£ - (A ,M) where M' = aJ /2 MAJ /2 . 
V m \fc=l €=1 / 



Let 



M'(i,j) = 
where 

and Z(i,j) := 



CiiCij CaCjj 



aTma 1 ' 2 



\M'(i,j)\\ F = 



33 



Al /2 MAl /2 



1 (fsf. M'jiJU 



9k9i 



(A ,M] 
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Let g(t)k,k = 1,2,... ,t = 1, . . . , n be independent standard Gaussian random variables. In addition, we 
need to bound the sum for i 7^ j, i,j = 1, . . . , /, 



1 {AtlM) * 
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\fb~d> 



33 



'2m 



( n 



1 " M M x(t)\ x z(t)* 



, n 

k=i e=i \ t=i 



y/hib 



(A ,M] 



J i3 
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1 1 



n 



'2m 2m -\ 



\Jbiibjj 

6,; 



\/biibjj 



t=l 



where Z\(i, j), . . . , Z n (i, j) are independent copies of random variable defined as above. We can then 
apply the Bernstein's Inequality to obtain the following: for C\ = 5.0088 and C2 = 7.6885, and denote by 
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D 



I \fra, for all j, 



1 n 



t=i 



> e < 2 exp 



ne 



dD 2 + C 2 De 



and for all z + j, i, j = 1, . . . , /, 2 Vl = dD 2 /2, and 2VF = C 2 D/V2 



\ t=i 



> 



V2. 



< 2 exp 

< 2 exp 

< 2 exp 



72 \ 



2ui + 2We/V2J 
ne 2 /2 



dD 2 /2 + C 2 (D/V2)e/V2 
ne 2 



dD 2 + dDe J ' 

where for both inequalities, the second term in the denominator starts to dominate only when e > ^P- 
Thus we have the following by the union bound: for e = c 



A)! 2 MA)! 2 A / logmi " (/ '" l) := cW log(/Vm) 
u o p v mn V n 



k=l i=\ 



> £ 



t=l 



> £ 



\ 



> 



V2, 



< (/ 2 + /)exp - 



c 2 log(/ V m) 



Ci + C 2 <yiog(/ V m)/n 



< (/ 2 + /)exp - 



c 2 log(/ V m) 



< 



f + f 



d + d/ft J ~ max(m,/) 4 



where in the last two inequalities, we used the fact that c 2 > 4(Ci + C 2 /k) and n > (kc) 2 log max(m, /), 
and applied the Bernstein's inequality. Similarly, we have for e = cD' ^ ^^ m "> 

mm 

dmg' l / 2 (B )-= (s k n l M M - a ke M M B ) diag- 1 / 2 ^) > e' 



k=l 1=1 



< 



m + m 
max(m, f) A 



The theorem is thus proved for n > (kc) 2 logmax(m, /) > 4(k 2 Ci + C 2 k) logmax(m, /) where k > 
can be chosen to be small enough so that the lower bound reaches log max(m, /). n 
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D Poof of Theorem 4.3 and Corollary 4.4 



Let G := p(Aq)- 1 >- and <£> := p(-Bo) -1 >- 0. Let 6 = (%) and $ = (<t>ij)- Then, 

< ¥?min(p(A))) < 1 < </>max(p(A))) < +°o and hence k(p(A )) > 1, 
and < </wO(-Bo)) < 1 < ^maxO(A))) < +oo and hence k(p(B )) > 1, 

given that ££j <MKA))) = tr(p(A))) = m and £/ =1 <Pi{p(B )) = f. 
We first write our estimator for ^(Ao) -1 and $ for p(Bq)^ 1 as follows: 

9 = arg m iri{tr(ef(A ))-log|G| + A B |e| 1)Off }, (52a) 

$ = argmiri{tr($r( J B ))-log|$| +X A \^\ 1>aS }, (52b) 

where Xa,^b are non-negative regularization parameters, and T(^4o) and T(Bq) are sample correlation 
matrices. A unique minimizer exists for each objective function above; see Ravikumar et al. (2008). Then 
A := _1 and B := <3? _1 , where A and B are the unique minimizer s as defined in Theorem 4.3. Let us 
define the following shorthand notation: 

Sao = ■ hj + 0, i / j} and S Bo = ■ 4>ij / 0, % ^ j}. 

Let A Ao : = 9 - G and A Bo := 8 - $ - 

We need the following lemma. For an index set 5 and a matrix W = [wij], write Ws = (wijl((i, j) G S 1 )), 
where /(•) is an indicator function. 

Lemma D.l. Le? Go >- 0. Le? 5 = {(i, j) : Goij ^ 0, i ^ j} and S c = : Goij = 0, i / j}. 77z<?« 

/or all A G R mxm , we /zave 

|Qo + A| lj0ff - |© |i,off ^ l A ^li-|Asli (53) 
Moreover, we have on event T(Aq), for all A G R mxm , 

tr(A(f(A))-p(A))))| < <5 n ,/|A| 1>off = <5 n!/ (|A5c| 1 + |A 5 | 1 ). (54) 

Proof. WewriteQo = diag(Go)+Go,s+@o,S' c and observe that &o : s c = and hence | ©o 1 1 ff = l®o,sli = 
l o,s| ljOff - Thus 

l o + A| 1;Off = |e 5 + A 5 | 1>off + |A 5c | liOfr , hence 
l©o + A| li0fT -|e | lj0ff > |G 5 + A 5 | liOff -|e ,5| liOff + |A 5 e| liOff (55) 

> |A<Hi,off-|As| li0fr = |A<H 1 -|A 5 | 1 (56) 

where (55) follows from the triangle inequality and (56) follows from definition of S and S c . The "more- 
over" part follows from Lemma 4.5 and definition of event T(Aq). □ 

Proposition D.2 is a standard result; see Zhou et al. (2008) for its proof. 

Proposition D.2. Let B be ap x p matrix. If B >~ and B + D >~ 0, then B + vD >~ Ofor all v G [0, 1]. 
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D. 1 Proof of Theorem 4.3 



The proof follows arguments in Rothman et al. (2008), and hence omitted. □ 

Proof of Corollary 4.4. Let T(Aq) be the sample correlation matrix. Let A := Xb- Let be the 
optimal solution to (52a). Let Go := p(Aq)~ 1 . Then A = — Go minimizes G(A) defined as follows: 

G(A) := log|e |-log|Go + A|+tr(A /3 (Ao)) + tr(A(f( J 4o)-p(Ao))) 

+ A(|e + A| liOff -|e | 1)Oir ) 

= A + tr (A(f (A) - p(Ao)) + K,f (|6 + A| l off - |G | ljOff ) (57) 

> A-rf n)/ |Ase| 1 -<y n ,/|As| 1 +A n ,/|Asc| 1 -An J /|As| 1 (58) 

= A — (S n j + Xnj) \As\i + (A n j — S nj f) \&s c \i (59) 

where A = vec { A } T ( /^(l - v)(0o + vA)" 1 ® (G + vA)- 1 dv) vec { A }, and in (58), we have ap- 
plied (53) and (54). 

Clearly G(0) = and hence G(A) < G(0) = 0. Hence for X nJ := X B > 5 nJ /e, we have by (59) 

-A + (5„ i/ + A„,, / )|A 5 | 1 > (A n , / -5 ni/ )|A 5 c| 1 > (l-e)A n , / |A Sc | 1 
andhence (l-e)A n) /|A g o| 1 < -A + (5 n j + X n j) \A S \ 1 

< -A+(l + £)X nJ \A s \ 1 

Thus we have (Agc^ < i±| \A$\ 1 given that A > 0; To see this, we note that Go >- 0, and hence 
Go + vA >- for all v G [0, 1] in view of Proposition D.2, given that G = Go + A >~ as an optimal 
solution to (52a). To see the last inequality, we have 



G - G 



li0ff = l A ^li,o ff < ^|A 5!l <^^V||A s || F 

2 



o 

l.otf 

" T^\f\ A L \o,os\\ A A \\F' 

and it holds by plugging in the bound on || A^ \\ F . n 
D.2 Discussion 

To get a sense of how tightly we can bound these entries in A^, we do the following calculations. For 
e = 2/3, we have | Asc\ 1 < 5 | As\ v and hence 



Q-G 



1 oS = |Aa Ii jOS < 6 lAslj < Q^/\A \ oS \\A S \\ F 



lw 2*4,„W4))) 
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From Corollary 4.4, we have for Qq = p(Aq) 1 and for Sa 7^ 

l A ^oli,off ,„_!+£. S. 



1^0 lo.off 



27- 



0. 



Such results are useful for bounding the number of falsely selected edges and the number of falsely deleted 
edges under certain separation assumption, which states that: the minimum absolute value of 0y : minj \ 9-ij \ 
among all non-zero off-diagonal entries in ©0 is bounded away from 0. We do not pursue such refinements 
in this work. On the other hand, for the diagonal part of A^ , we have 



9 



|diag(A j4o )| 1 < y/m\\A Ao \\ F ^ 



andhence^^K 9 1 + 



1 + e 



XB^Jm(\A^\ oS Vl) 



111 




A- n 



C(P(^))) 



2 viMAo)) 




0,off 



V 1 



in 



at a faster ( or comparable) rate in terms of l\ norm in average than that for the off-diagonal entries when 

\Sa \ < rn (or whenjS^J x m). 



E An outline for convergence in the operator and the Frobenius norm 



We need some auxiliary results for proving Theorems A. 1 and A. 2. Throughout this section, we assume that 
event Xq as defined in Lemma 4.2 holds. It is clear from the proof of Theorem 4.1 that T{Aq) n T(Bq) 
holds on event Xq, and thus all statements in Theorem 4.3 hold on Xq. Let W\ and W2 be as defined in (7a) 
and (7b). Claim E.l provides the large deviation bounds on the spectral norm for estimating W\ and W2 
using W\ and W2. 

Claim E.l. Denote by f3 n := ^n^A an d ct n := t ™ Ao y where tq and Tq are as defined in Theorem 4.1. 



Let 1 > Xa > 3a n and 1 > Xb > 3/3 n . Let /Ix{Bq) = diag(an, . . . 



diag(6 



11: 



5 //J 



Then on Xq, we have for f3' r> 



< 



~$ and a 'n 



mm, 

y/1-Q.n 



Wx - Wi 



-1 



-1 



and 



W2-W2 



1! 



< /3' n /y/tT(Bo)y/a m in, 
max • 
min- 



and W 2 /tr(Ao) 

A4„ 



< 



V6 



We next state the following convenient results. 

Lemma E.2. Let \\-\\ be a matrix norm which satisfies the triangle inequality and is multiplicative with 
respect to the Kronecker product: \\A (g) £?|| = ||A|| Then for A A := A\ — A and Ab := B\ — B, 



\A 1 ®B 1 -A®B\\ < \\A A \\ \\B\\ + ||A B || \\A\\ + ||A, 



^B 
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It is clear from Lemma E.2 that the intermediate results in Lemma E.3 are useful in bounding the operator 
norm and the Frobenius norm on A and A' to be defined in (60) and (61). 

Lemma E.3. Suppose (Al) and (A2) hold. Let a n < A^ /3 and f3 n < X Bo /3 where Xa , X Bo < 1- We have 
on Xofor some absolute constants 9 < C, C < 18, 



WtAWx/triBo) -A < 2CX Bo a ma ^ ( P (A Q )f J\A^ l \ off V 1, 



and 



W 2 BW 2 /tx(A ) -B < 2C'X Ao b miiX K(p(B )y J \B^ | Q off V 1; 



and for some 5 < C, C < 19/2, 



tr(S ) [WiAWi 



and 



tr(A ) W 2 BW 2 



An 



B n 



< 



< 



2CX Bo J\A 1 \ Q ^ {i V 1 



a m mf 2 min (p(A )) 



2C'Xa J B t 



V 1 



0,off 



V 1 



bminfi in {p{B )) 



On Xo,for some absolute constants 18 > C,C > 9, 



WiAWiMBq) - A f < 2CX Bo a max n {p(A )) z A /|A 



-i 

lo,off 



V m, 



and 



W 2 BW 2 /tv(A ) -B < 2C'X Ao b ma ^K (p(B )) 2 a/IVL off v 



and for some 19/2 > C,C > 5, 



tr(B ) W X AW X 



and 



tr(A ) W 2 BW 2 - B, 



-i 



-i 



A, 



-i 



< 



< 



2CX Bo J\A 1 



0,off 



V m 



amm<p 2 min (p(A )) 



2C ' X A J\B 1 \ s V / 



&min<£mh>( 5 0)) 



Let us denote by A and A' the following: 
A : 



WiAWi ® W 2 BW 2 /(tr(Bo)tr(A )) - A ® B , 
tr(B )tr(A ) (WxAwX 1 <g> (w 2 BW 2 )~ l - A^ 1 ® Bjf 1 . 



(60) 
(61) 



Lemma E.4. Let \\-\\ be a matrix norm which satisfies the triangle inequality and is multiplicative with 
respect to the Kronecker product: \\A® B\\ = ||A|| ||I?||. Then for A® B as defined in (8), we have for 
S = A ® B , 



A®B 
A®B - S 



y-l 

^0 



< (a n A /3 n ) HAq 1 !! llB^H + (1 + a n A /3 n ) ||A'| 



(62) 



< AAo * Poll HBoll + (1 + AAo S ) II A|| • (63) 
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F Proofs for the rates of convergence for the Gemini estimators 



F. 1 Proof of Theorem A. 1 



Throughout this proof, we assume that event Xq as defined in Lemma 4.2 holds. First we bound ||A'|| 2 
using Lemmas E.2 and E.3 as follows. Let C, C be some absolute constants such that 19/2 > C,C > 5, 
and W\ and W2 be as defined in (7a) and (7b) respectively. 



Then on Xq, 
llA'l 



< 



/ W X AW\ 
^ tr(So) 



A' 1 



+ 



r?— 1 1 1 1 II 4-1-11 
n o II2 ' ll^-o lb 

tr(B ) (w^A^Wi 1 ^ - A 1 ^ tr(A ) (w^B^W^ - Bq 1 



W 2 BW 2 \ 1 _ , 



< 



2CAs oA /|-4o 1 !oofr v 1 2C ' x a J\B ( 



0,off 



V 1 



+ - 



(p(A ))<p min (B ) (p(B ))(p m in(A 

4CC'Xa Xb 



+ 



(64) 



An 



VU \B, 



VI, 



and for || A'|| 2 bounded as in (64), we have by (52), (52), (42a), and (42b) 



{a n A p n ) II A' 1 1 < 



4CrC "^o^Bo\/|A) 1 |o,ofiE V 1 \/\ B 



3-1 



O.off 



V 1 



3a m in6min^mi n (p(^o))^i n (p(-Bo)) 
Vmin Vmin 



H /, = =+ X A A X Bo 



< 



8CC Xa Xbo 

5d m i n b m [ n 



A 



-1 



0,off 



VI,/ B, 



0,off 



V 1 



(65) 



The bound in the theorem statement for estimating Aq 1 (g) B^ 1 thus holds by inserting (64) and (65) in (62). 
Next we bound the error in the operator norm for estimating Aq <g) Bo. Let 9 < C, C < 18. On Xq, we have 
by Lemma E.2 and Lemma E.3, 



l|A|| s 



< 



Wi 



A 



(. 



(. 



Wo 



B 



(. 



Wo 



V^iBo)) \V^(Bo)J \V^(^)) V\/trOV) 



A ®B 



WyAWx 



tv(B 



A ( 



I-B0II2 + 114)1 



W 2 BW 2 



tr(A ) 



Be 



+ 



W X AW X 



tr(B ) 



A ( 



W 2 BW 2 



tr(A ) 



Br 



< 2CX Bo a max \\Bq\\ 2 k (p(Ao)) 2 x /\Aq \ V 1 + 2C'X Ao b max \\A \\ 2 n (p(B )) 2 \/\Bo\ oS V 1 



(p(A )) 2 K (p(B )) 2 J\A^\ 0oS V lJ\B^\ 0olI Vl, 
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and hence by (52), (52), and (42e), 
Xa A Xb 



|| A|| 2 < 2CC'X Ao X Bo a mabX b max K {p(A )Y k (p(5 )) + 



-ll 



< 



^0 lo,off 

20CC 



vi, /LB 



r 1 ! 

'0 lo.off 



V 



l) x + ^7 + ( A ^o A X Bo )) 



9 



A^ABoOmaxftmaxt 0(A))) 2 « (p(Bo)) 2 yj\ A Q 1 | 0off V 1 ^/j # 1 V 



The theorem is thus proved by inserting the bounds immediately above in (63). □ 



E2 Proof of Theorem A.2 



Throughout this proof, we assume that event Xq as defined in Lemma 4.2 holds. We first obtain the rate 
of convergence in the Frobenius norm for estimating Aq ® Bq. First we bound || A\\ F using Lemma E.2 and 
Lemma E.3 as follows: for 9 < C, C < 18, 



\\A\\ F <\\B \\ F 



WiAWi 



tr(B ) 



Ac 



w 2 bw 2 



tr(Ao) 



B r 



+ 



WxAWi 



tr{B 



A ( 



W 2 BW 2 



tr(A 



Be 



< 2CA Bo a max H^oII^k^o)) 2 Vl^Lff Vm + 2C "Wmax ||4)|| F « (p(B )f Vl 5 o ^ooff V / 



(p(^o)) 2 « (p(B )) 2 v I A o 1 lo, off v m y\ B o\ oS v /• 

Hence for ||5 || F < V? ||B || 2 , and Po|| F < Po|| 2 , we have by (42e), (52), and (52), 
XAo * II A Hf < 2CC'X AQ X Bo ^b max K (p(A )) 2 k ( P (B )) 2 ■ 

U\ A o\ oS Vrn^/\B \ o{{ Vf) x (± + _L + (A A() A A flo )) 



< 



20CC" 



9 



AA AB a max 6 max K (p(^ )) 2 k (p{B )) 2 ^\A Q 1 \ 0oS V myj\B Q V /. 



The bound on 



A <g> £0 - Aq <g> B 



thus holds by inserting the bounds above in (63). We next obtain the 



rate of convergence in the Frobenius norm for estimating A Q 1 ® B Q 1 . Let 19/2 > C, C" > 5. We bound 
|| A'Uji using Lemma E.2 and E.3 as follows, 



fWiAWi\ 1 
^ tr(5 ) J 



4" 1 



+ llA) NIf 



tr(^o) J 



+ 



WiAWi/ti(B 



-1 



.4, 



-1 



W2BW2 MAq) - B, 



-1 



2CA Bo ||B J\A 1 \ QoS V m 2C"Aa \\A 1 \\ f J\B \ ff V / 

< * - . - . . ' + ; o rTTTT^ ' + 



+- 



aminV 2 nin (p(^0)) 

4CC'A Ao A Bo 



bmm(p 2 miQ {p{B )) 



<2min"min 



.4, 



-II 

lo.off 



Vm\ \B. 



r 1 ! 

^0 lo.off 



v/. 
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And thus by (42a), (42b), (52), and (52) 



2CX Bo Xa J | A 1 1 o off v 



Xa A X Bo ii^ii < 

3 3<2 m i n o m i n (/? m j n 

(P(^b)) 



+ 



2C / A A() A B() J|V| 0off V/ v / ^ 



3flmin0min ^min(P( jB o))¥'mm(p(vlo)) 



+ 



Xa A Ab 



4CC'A Ao A So 



-1, 



-ll 



< 



a min 6 m in< in (p(Ao))^ in (p(i?o))V l lo^ 
8CC'Xa Xb 



Vm, B, 



-li 
'0 lo.off 



v/ 



A. 



5a m i n 6 min ^ in (p(^o))^ i n(p(^o))V 1 ^ '»•"«■ 
The theorem is thus proved by putting things together in (62). □ 



E3 Proof of Corollary A.3 

Throughout this proof, we assume that event Xq as defined in Lemma 4.2 holds. We first note that the 



bounds on 



Bif — 



and 



b: 



b- 1 



follow from Lemma E.3 immediately: forW^ = yj tr(Ao)diag(i?o) 1 ^ 2 



and W2 as defined in (7b) 

B* — B* 



W 2 BW 2 - tr(A )B 



tr(Ap) 



W 2 BW 2 /ti(A ) - B Q 



We can now plug in the bounds on the operator and the Frobenius norm from Lemma E.3. The proof for A* 
is long and monotonous, and hence omitted. □ 



F.4 Proof of Claim E.l 

We note that the following bounds Mi = 1, . . . , m hold on Xq, for /3 := /3 n , 

2 



1 ££=1 x(t) 



tr(flo) 



< 



aatr(Bo) 



1 < 



/r ° :=/3 and 



max 
i=l,...,m 



/ a ii tr(S ) 

Thus the first inequality holds. Similarly, on Xq, we obtain 



tr(flo) 

< (1 - x/w) v (y/i + p - 1) < p. 



Vi = 1, . . . , m, 



< 



< 



|2 " 



and hence 



v / a ll tr( J B ) 



< 



V 



< 



/3 
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Thus the second inequality also holds. The statements about W2 and W 2 also hold following the same line 
of arguments. □ 



F.5 Proof of Lemma E.2 

We have by distributivity of the Kronecker product, 

AiOBi = (A + A A )®(B + A B ) = A®(B + A B ) + A A ®(B + A B ) 
= + A B + A A (g> B + A A (g> A B , 

hence by the triangle inequality, we have 

pi <& B x - A® B\ = \\A® A B + A A ® B + A^ ® A B || 
< p® Afl|| + HA^OSII + ||A A (g) A B || 
= ||A B || Pll + \\A A \\ \\B\\ + \\A A \\ \\A B \\ , 

where the last step follows from the multiplicativity of the norm with respect to the Kronecker product. □ 



F.6 Proof of Lemma E.3 

In the following proofs, we will only derive the bounds for estimating Aq and Aq , as the bounds for Bo 
and Bq 1 are derived in exactly the same manner, and hence omitted. We need the following proposition. 
Proposition El. Let W and W be diagonal positive definite matrices. Let \P and \£ be symmetric positive 
definite matrices. Then 



< 



< 



+ 



+ 



w-w 
w-w 

w-w 
w-w 



+ IIWI 



w-w 

2 



+ 2 



1^1 



+ w 



2 

$ - * 



w-w 



F 

+ 2) 11*1 



F ■ 



Proof 'of Lemma E.3. Throughout this proof, we assume that event Xq as defined in Lemma 4.2 holds. 
Let diagonal matrice W := W\j \J\x{B§) and W := W\j \J\x{B§). By Proposition El, Claim E.l, and 
Theorem 4.3, we have for f3 := f3 n < X Bo /3 < 1/3 and 18 > C > 9, 



WiAWxMBo) - A, 



WAW - diag(^ ) 1/2 p(^o)diag(A 



,1/2 



< (l + /3) 2 a max A-p(A ) +(/3 2 + 2/3) 

"max 



< 2C\ Bo a max K(p(A )f 



A lo.off V 
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where we used the fact that k(p(Aq)) > \\p(Ao) || 2 > 1, given that ¥?min(p(A})) < 1 as shown in (52); and 
WiAWiMBq) - A < (l + /3) 2 a max A-p(A ) + (/3 2 + 2/3)a max \\p(A )\\ F 

F F 



< 2CX Bo 

off V m ' 



Similarly, for /3' := /3 n /i/l — f3 n < Xb /VG, where f3 n < 1/3, we have by Proposition F.l, Claim E.l, and 
Theorem 4.3, 



/ WxAWi \ 1 
^ tr(5 ) ) 



r\2 



< 



(! + /?') 



A-'-piAor 1 



+ — — — m A o) h 

z "mm 



w x aw x 

tr(5 ) 



-i 



A' 



< (2C + l)\ Boy /\Ao\ oS V l/(a minV 9 2 nin ( /3 (A ))) 
I- 1 - p(Ao)- 1 



< 



(1 + /3') 2 



F + — — llA A o) \\ F 

r u min 



< (2C + l)A Bo ^jVj^ V ^/(a^^iJp^o))), 
where 9/2 < C < 9 and hence the statement in the Lemma holds for5<C<19/2. □ 



Proof of Proposition F.l. By the triangle inequality, we have 
WVW - WW 

(W - W)$(W -W) + WV(W -W) + (W- W)WW + W($ - V)W 



< 



w-w 



+ 2 



w-w 



\\W\\ 2 ||*|| 2 + 



w-w 



+ 



f - f 



Similarly, we have 

W^W - WVW 

2 



< 



< 



w - 


w 


w - 


w 



F 














$ 


+ 2 

F 


W-W 


2 W\\ 2 




r 


$ - f 



W-W 



+ 2 



+ 



W-W 



+ 



F 

$ _ q. 



□ 



F.7 Proof of Lemma E.4 

By the triangle inequality and the multiplicativity of the norm with respect to the Kronecker product, (60), 
and (61) 



W^A^Wf 1 ) <8> [W 2 l B~ l W 2 l 



tr(A))tr(5 ) 
\ A Q 

(WtAWt) (W 2 BW 2 )/ti(A )tr(B ) < \\A \\ \\B \\ + ||A|| . 



< IIVIIM + IN 



(66) 
(67) 
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First following the large deviation bounds in Theorem 4.1 (see also Remark C.2), we have for Xa > 3a r 

and X Bo > 3/3 ft , where a n A n < Xa ° a 3 Xb \ and by (18), 

iEr=ill^(*)llF-tr(A )tr(Bo 



£E£=iII*(*)IIf tr(^ )tr( J B ) 

(a n A /3 n ) 



< 



< 



thus 



l nT:u\\m\\ 2 F 

Xa A A# 
2tr(A))tr(B ) 
tr(A))tr(S ) 



< 



iE2=ill^(*)llF-tr(^Q)tr(Bb) 

On A P n 



tT(A )tv(B )(l-a n AP r , 



1 



< 



a n A ffn 
1 - a n A /?„ 



■Er=iiw)iiF 

By the triangle inequality, definition of A, (67), and (69), we have 

A®~B -A (g)Bo 



< 



Xa A Xb 



< 



1 



1 



WiAWx 



< 



+ 

Xa A Xb 



P || \\B \\ + (1+ Xao/ " Xb ° )\\A\\. 



JE£=iIW)IIf tr(A))tr(Bb) 

WiJWi) ® fe-BW^) /tr(A )tr(S ) - A ® 5 



w 2 bw 2 



(68) 
(69) 



and by definition of A', (18), and (66), 
-l 



A « B 



< 



(£ E?=i ll*(t)|&) (wr^wr 1 ) ® - 1 ^ b 1 



tr(i4o)tr(S ) ( Wf A A- A Wf A ) ® ( W^-B^Wjf 1 



/T 1 



B, 



-i 



< (a n A/3 n ) {\\A G 



+ iiA'in + iiA'i 



G Proofs for the Flip-Flop methods 

We need Lemma G.l to define event £q, which we will use for proving Theorem 6.2 and Theorem 6.6. 
Lemma G.l is a corollary of Theorem C.l, where we also denote the entry- wise error for approximating 
p(A ) and p(B ) with diag(A)" 1/2 l( J B,)diag(A)" 1/2 and diag( J B*)- 1 / 2 J B(A)diag(S*)- 1 / 2 by A/,„ 
and X m ^ n respectively. These rates vary slightly depending on the sample size n. We use this notation 
throughout the rest of this section. 
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Lemma G.l. Suppose that n < log(m V /). Let d = 1 — (c/2 A 1/2), where c = logl o°^ v j) 

log d max(m, /) 



. On event £q, 



diag(A)~ 1/2 A(^)diag(A)^ 1/2 - p(A ) 
diag(^)" 1/2 5(^)diag(^)" 1/2 - p(S ) 



< 20- 



IJn 



log d max(m, /) 

< 20 ;= =: X m ,n 



'mn 



where P (£ ) > 1 r—FZ- 

v u/ — max(m,/J^ 

Otherwise, let n > A(C\k? + C2K) logmax(m, /), where k > 0. 77ie«, o« everc? £0, 



diag(^)- 1 / 2 ^( J B,)diag(^)- 1 / 2 - p(A 
log 1 / 2 max(m, /) 



< 2 VCi + C 2 A 



diag(^)- 1 / 2 J B(^)diag(^)- 1 / 2 - /5(A) 
< 2yC^^ lQgl/2max( ^ /) =: A m , ra 



where C x = -#= » 5.0088, and C 2 = \/8e « 7.6885. 

V 07T 



G.l Proof of Lemma 6.1 

In order to prove Lemma 6. 1, we need the following auxiliary results. 

Claim G.2. Suppose all conditions in Lemma 6.1 hold. Let B\ and B be as defined therein. Let Bq 
B\ — B*, where B* = tr ^°) Bq. Then we have on Xq, 



K A 



B- 1 


a 


B- 1 




i,off 1 — a 





< tT(BoB^) 



< Aa 



B 



-1 



+ 



O: 



Loff 1 — a 



B 



-1 



=: fV>- 



Corollary G.3. Suppose all conditions in Claim G.2 hold and let Xa = 2a/e(l — a), where < e < 2/3. 
Then we have on Xq, 



tr 



(BqB^)\ /f <J1<^ = \p(B Q y\/f + X Ao \p(B )-\ oS // + o(X Ao ) 



Proof of Lemma 6.1. Let 



Ra 



vec 



{Si 1 }...vec{si f }...vec{s«} 



Ra = [hi vec { A } . . . bif vec { A } . . . bg vec { A }] = vec { A } (g> (vec { Bq } ) T 



In order to obtain a bound on 



A(Bi) — A* , we first write vec < A(Bi) — A* > as follows: 

max L > 

vec { A(Bi) - A* } = i(R A - R A )vec { B" 1 } + ±R A vec { B^ 1 - B" 1 } 



(70) 
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where in (70) we use the following equation 

jiiAvecjB^ 1 } = = jvec{A } <g> (vec { Bq }) T vec 



tr(A ) 



B, 



771 



tr(A c 



-vec { A } i(vec { Bj }) T vec { B^ 1 } = vec{ A, } . 



In order to proceed, we now compute A(B\) as defined in (25), where an explicit formula for B 1 1 is needed. 

For B := B x - B*, we have A := 1 - B^ 1 = -B^ 1 (Bi - B^B' 1 = -B^BqB' 1 and thus 

i^vecjB^ 1 -B; 1 } = R A {-B^\B 1 -B^)B- 1 ) = R A vec{-BY 1 B B: 1 } 

= R A vec { -B^BqB; 1 } + (R A - R A )vec { A } . 

We now insert the equation immediately above in (70) to obtain 

vec { A(Bi) - A* } = ±(R A - R A )vec { B" 1 } 

+jR A vec { -B^^oB- 1 } + }(R A - R A )vec { A } 



=: U 1 + U 2 + U 3 , 



(71) 



where the matrix correspondent of each summand will be denoted by Mi, Mi, and M3 respectively. Now 
for the first summand on the RHS, we have 



U 



1 = j(R A — R A )vec { B,: 1 } = vec j A(B*) — A* | 



and Mi = A(B*) — A* = i E{=i Yl{=\ Sn(B* x )kt — A*. By Lemma G.l, we have on event So, 



|Mi. 



Aij(B^) — a* 



■1 j 



< 



We now examine the second summand on the RHS of (71), where recall that Bq = B\ — B*. 

R A vec { B^BoB; 1 } = vec{A }vec{Bj} T (B; T »B^ 1 )vec{B } 

= vec{A*}tr(B ] ; 1 Bo). 

Then clearly on Xq, 

U 2 = jRAvecl-B^BoB- 1 } = ±vec { A* } tr(-B^B ), 
and M 2 = A^tr^f 1 ^)- 
By Claim G.2 we have on event Xq, 

tv(B B^) 



\M- 



< la 



*M I 



/ 



(72) 



+ _9L_1 



B- 



I I 
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Finally, we bound the third summand. Let A := B x 1 — 1 , 

U 3 := j(R A - R A )vec { A } = \ ^Li Ej=i ™c { Sn j - b kj A } A° k , 

and M 3 := ) (£{ =1 EJ=i ^A° fe - tr(5 A )Ao) - 
Define event £\ as 

idiag^o)-^ ( E { =1 E / =1 A° .S$) diag(^o)" 1 / 2 - tr ^p(A c 



< D'X 



where 



D' 



Bl /2 A°B l /2 



and by Theorem C.l, we have F (£i\Xq) > 1 — ^^yp ■ To see this, we use the following bound as shown 
in Corollary A.3 on event Xq, 



b- 1 - B: 



< 



B, 



-H 



I 



2C"Aa 



V / hence under (Al) 



jB o 1 !o.ofr + V 7 / 



"I ^ J +0(X Ao ) = o(l). 



Thus we have under event £\ n Ao, under (A2), 



-tr(A c 



77? 



-D'Xf in — y/a* t aa*jjO (Xf, n ) ■ 



We write the matrix correspondent of equation (71), under event Xq n £q H £\, 



A(fli) - A, 



< 



A(B*) - A, 



+ 



tr( J B 0J B; 



/ 



+ |M 3 , 



A/, n (l + o(l)) + | a, 



where by (72), /x = Aa -B 1 I f + B 1 // is upper bounded by /x = Aa 4 Ip(-Bo) 1 |i off + 

l,off 1 ^ ' 

jz^j \p(Bo)~~ 1 \ 1 + o(Xa ) as shown in Corollary G. 3. The lemma is thus proved. □ 

It remains to prove Claim G.2 and Corollary G.3. 

Proof of Claim G2. Throughout this proof, we assume that event Xq holds, and Aa > fz^j- By 
definition, for T(Bq) be defined as in (6b), 



B(I) 



^ m 1 / n 



fc=i 



f 



W 2 T(Bo)W 2 
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where W 2 = diag(5(I)) 1 / 2 y 0. We have by the KKT conditions, 

Bij - f -(B )| < A A) , V^ 1 = ( hence B~j. = 0) 
Bij - TijiBo) = X Ao , V^ 1 > ( hence B^ > 0) 
and - ry(S ) = -A Ao Vi?^ 1 < ( hence B^ < 0) 

where B x = ^W 2 BW 2 = W 2 BW 2 , and thus 

Vt, j B lfij - B i:i {I) = W 2>ii (Bij - f -(S )) W 2tjj 

( 



if i = j 

if > 
if B7l < 



, G [-W r 2,«A > i W2j J -,W r 2,«A jlo W2 iii ] if B-y = 
tr ((Si - B(I))B^) =J2^M X A W 2>jj B 



Thus we have 



I-'./ 



£ w 2^a w 2 , J3 :u, '/};/! I 2 ./ ; = a Ao 



S" 1 



Loff 



The claim is proved once we show that 

tr ((B(I) - B*)B^ X 



< 



B 



-i 



l 1 — a 

Indeed, for B = B 1 - B* = {B x - B{I)) + (B(I) - B*), we have 



Aa 



Q: 



i,off I — a 



S(I)- 1 ^tr^oBr 1 ) 



tr (Si - B(I))B^ L + tr (B(J) - 



>-i 



< A^o 



+ 



a 



i,off 1 — a 



5" 1 



It remains to show (73). 

Before we continue, we need the following inequalities. We have on by Lemma 4.2 for a := a r , 



1 /6u6^tr(^4 ) 



Pij( B o) 



and Vi = 1, . . . , / 



and hence, for all i, 



1 



and Vi, 



1 + a 

a 
1 + a 



< 



biitT(Ao) 
6iitr(A ) 



< a, 



< a, 



W? 1 ^ n 



6iitr(^ ) < 1 



2,ii n 



1 — a' 



> 1 



1 



6jjtr(A ) 



1 v^n 

it 



> 



-a 



Et =1 \\y(t)f 2 1 ~ a 
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Thus we have for ^ < s/b^uKjj / (W 2: uW 2 ,jj) < j^, 



£5 

/ 



^biibjjti(Ao) 
a 



Pij(Bo) 



JJ 



Wo U W 2 



JJ 



(74a) 



i,off (1 — a) 
6* 



, and hence 



< 



a 



1 — a 



diag^- 1 ) 



(74b) 



For B := B(I) and &»,«/W| (ii = tr^&K/Wf^, we have 

tr ((£ - B*)B^ 

= tr ((w 2 T(B )W 2 - di ag (^) 1 /2 p ( Bo )diag 



tr (r^o)^- 1 ) - £ ^(£0)5: 



.7.7 



\fb* t u \Jb*jj 



W 2 ,u W 2Jj 

f 



W 2M W 2 



+ E% X 1 



jj 



b*,ii 



*7^J 



y/bubjjtr(A ) 



Pij( B o) 



Vb* 



w 2M w 2 



■JJ 



Clearly (73) holds by taking the absolute values on both sides of the last equation, applying the triangle 
inequality, and then plugging in the inequalities (74a) and (74b). □ 

Proof of Corollary G.3. Throughout this proof, we assume that event X§ holds. Let A# := B^ 1 — 
p{Bq)~ 1 . For \a q = 2a n /e(l - a n ), where < e < 2/3, we have by (Al) and Corollary 4.4, 

IdiagCAflJl! < Vf\\A Bo \\ F < jf- ^ . 1 (p(^ )) A ^ Vl^lo.off V 1 = o (v?) 

|A B0 | liOff < ^l^'lo.offl^ (9^0^ \off V l)/(vLn(p(S ))) 

'icoff (i 1 - e e )£ 1 X Wi^u v 



< 



-11 
J o lo.ofr 



where (9A Ao J|^o 1 | ,aff V 1 )/(v 7 min(/°( B o))) = o(l) by (Al) and (A2), and is a constant so long 



as 



e is bounded away from and 1. Thus for J\ B Q | 0off + ///<!, we have 



} IAsJ, = o (1/V7) + o WlVlo.off = ( VI Vlo^fl + / = °(!)- 
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Thus by the triangle inequality, we have for Xa x ot n /{l — a n ), a := a n , and e < 2/3 

>-i 



l.off 



^ \p( b o) 1 |ioff+ ° 



B, 



Loflfj ' 



B 

$~\< \pW\ + l A Boli < \p{Bo)-\ + o y\Bo\ oS + / 
and hence fx < \p(B Q )~ 1 \ 1 / f + X Ao |pCB ) _1 | l off // + o (X Ao ) . 

The corollary thus holds, n 



G.2 Proof of Theorem 6.2 

First we observe that (34) follows from (33) immediately given that rj = A/ jn (l + o(l)) + ju < r\. We now 
show that (33) follows from (30). Indeed, we have for all i,j,aa event A\, 



r\(Ao) - Pij(A ) 



A ij (B 1 ) 
A]/ 2 (B 1 )A]! j 2 (B 1 : 



Pij( A o) 



(A]l\B 1 )/^)(A^(B 1 )/^a—) 



Pij( A o) 



{A\l\B 1 )/^){A^(B l )/^) 
< Am(1 + o(1)) + /I|^(^o)| +| ^ w| 



1/2, 



1 — 77 
A />n (l + o(l)) 



1 — 77 



1 — 7? 

A /in (l + o(l)) + 2£ 



1 — 77 
A /iW (1 + q(1)) 

1 — 7/ " '' 1 ;) 

2 2?? 

— (A />n (i + o(i)) + MA))| m) < ^ 



where we used the fact that on Ai, by (30), 

A{B X ) 



Vi, 

and Vi / j, 



1 



< 77 and hence 



^(£i)f 



A(Bx) 



< A/, n (l + o(l)) + 



I I 



jJL. □ 



G.3 Proof of Corollary 6.4 

Throughout this proof, we assume that event Ai holds and m < /. Clearly event T(-Ao) holds for sample 
correlation matrix T(Aq) for <5 n j = < ^ on event .Ai, where 5j in w |Aq 1 | Q off V 1 = o(l) 
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by (Al). We have by Theorem 4.3, on event A\, 



A(Si)" 1 - p^o)" 1 p < 9(1 + e)X Bl \/\A 1 \ 0oS V l/(2^ in (p(A ))) 5 



and 



A(B 1 ) - p(A ) < 9(1 + e)K(p(A )) 2 \ Bl J\A 1 \ 0oS V 1. 



Let Wi = diag(^(5i)) 1 / 2 and W = diag(A*) 1/2 = diag( v ^TT, . . . ^/a^, 



tr(A ) 



(75) 
(76) 

diag(A)) 1/2 . 



We have for all i, by (30), 

Wt-W 



W 1M ~ a *,a 



W^ 1 - W~ l 



< a* t ur]. Then the following holds on A\, 

< (y/l+rj- 1) V (1 - v 7 ! - %al%3x < ^a*, max fj, 

/ yrr^-i l- yr^?\ i 



(77) 



< 



V 



a/1 + r) y'l — rj J y/a*,mh 



< 



— 7j -y/<2*,min 

By Proposition F.l, (76), and (77), and for rj < A Bl (l - rj)/2, rj < 1/4, and 18 > C > 9, 
1* - A* = W\A(B\jW\ - diag(^) 1/2 p(^)diag(A,) 1 / 2 || 2 

< (rj + 2)rja* >max \\p(A )\\ 2 + CX Bi k (p(A )f ^\A^ \^ V la*, max (l + 
(l_^+2) 



(78) 



A 



Bi" 



||/>(A))|| 2 + CA Bl K(p(^o)) 2 v/l^O 'lo.off V la *,max(l + »50 



< 2CA Bl a* imax K (p(A )) 2 V \A 1 \ oS V 1 and 



A* - A* < X B ||p(A )|| 2 + CA Bl a,, max K (p(A )) 2 ^\A \ oS V 1(1 + rj) 

< 2Ca* imax A Bl K (p(A))) 2 y / Vo 1 | ,off v m - 
Similarly, by Proposition F.l, (75), and (78), we have for rj < rj < 1/4, and 9 > C > 9/2, 



a- 1 - a: 



W^AiB^W^ 1 - diag(^)^ 1 / 2 ^- 1 diag(A 



-1/2 



< 



„ f _x CX Bl J l^o^ooff v 1 

-^-^ (jj + 2 VI " V) /(Vmin(p(A)))a*,min) + o 7 7~, I ' + 



l P 2 am(p( A 0)) a *,- 



< 2CA Bl ^|V| ,off V l/(^ n (p(A ))a», min ) and 

Ir 1 - a- 1 



A 



if •" 



< 2C7A Bi\/!V!o,off Vm/{ip 2 min (p(A ))a, Mn ). □ 



G.4 Proof of Theorem 6.6 and Lemma 6.5 

Proof of Theorem 6.6. Given Lemma 6.5, the proof follows exactly the same arguments as Theo- 
rem 6.2, and hence is omitted. □ 
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Proof of Lemma 6.5. Let 



Rb 



vec 



{S 1 \ 1 }...vec{s^}...vec{s7 n }~ 
Rb = [anvec{B } . . .ai m vec{ B } . . .a mm vec{ B }] = vec{B } <8> (vec { }) t . 

Let A\ = A* For convenience, we write A 1 = A^ 1 — A^ 1 = A~ l — A^ 1 and A\ := A\ — A* We have 



First we write 



A 1 = A' 1 - A' 1 = -A-\A l - A*)A~ l = -A~ 1 A 1 A~ 1 . (79) 

^vec { B(Ai) } - ^R B vec { A^ 1 } = ^R B vec { A^ 1 } - vec { B* } 

= ±(Rb- R B )vec { A" 1 } + ^R B vec { A 1 } + ^(R B - R B )vec { A 1 } 

:= V 1 + V 2 + V 3 , (80) 

where the matrix correspondent of each summand will be denoted by Ml, M2, and M3 respectively. Now 
for the first summand on the RHS, we have 



Vi = ^{R B -RB)yec{A- 1 } =vec{B(A,)-B,} 
= 7 £;=l vec { § n j " a kj B } A- 1 (jk) 

and Mi = B(A.) - B* = ± £{ =1 EL - B*. 

By Lemma G.l, we have on event £q, 

I -^1,27 1 — Bij(A*^ B^ij 'yb*^iifo*ijj ^m,n* 

We now examine the second summand on the RHS of (80), where recall that A\ = A* — A*. Now by (79), 

V 2 = ^vec { -A^AiA- 1 } = i-vec{B }vec{AT} T (A- T ®Ar 1 )vec{-A 1 } 
= i- V ec{Bo}tr(-A Ar 1 A 1 A- 1 ) = ^vec { B* } tr(— 1 Ai). 

Hence M 2 = S*tr(— A^ 1 A\) /m. From Claim G.4, we have on event A\, 



\M 2 ,ij\ = \\x(-A x Al l )\/m< \b* M \ (\ Bl 



AT 1 



+ 



A 



l,off 1 — 77 

Finally, we bound the third summand. Let A 1 be as in (79). By definition, 

Vs = ^ (Rb - R B )vec { A 1 } = ± ££Li ££1 ™ { §? - a kj B } Aj 



\Kij\€- 



jk 



and M 3 = i ET=i ET=l Sn J ^) 



jk 



trjApA 1 ) 



Bq. Define event £2 as 



where D4 



^ /2 A^ /2 



Vv^ < ||A || 2 \\A^ - A^\\ F /^i = o(l). 
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Then under event A\, we have by proof of Theorem C.l, P {£2\A\) > 1 — ( mV /)2 ■ To see this, we have by 
Corollary 6.4, on event Ai, 

D4 < -^\\Ao\\ 2 \\A^ -A' 1 ]] 

< ^= ll^oll2 2C ' A Bi\/|^0 1 |o,off V W( a *,min¥>Ln(p(A)))) 
1 . ^ . , 1 



m 



A WK lnofr + A Si =°(^=) + °( A Bi) = °( 1 ) 



m 



where note that under (Al), (A2) and (A3), we have for < e\ < 1, 
We have under event £2 H »4i, 



A/, n + A m ,„ = 0(A Bo ) -> and A Bl w|^ x | 0off = o(l). 



— y/bubjjD^Xm^n — ^fbif^ib^ jj sD^X m n — ^/b^^b^jjO (X m ,n) 



3,ij\ 



' JJ tr(A y 

The lemma is thus proved by plugging in the bounds on l-Mi^l, lil^yl, and l-M^yl. On iifl^, we have 



(B{A X ) 





< 







B(A*) - B* 





+ 







6* 



tr{A x A-^) 



??? 



+ |M 3 . 



y 



< yJb*,iiKjjX m ^ n (l + o(l)) + 1 6* ,17- 1 £ 



by (80), where £ = A Bl 



A- 1 



l.oflf 



/m+^= yl 1 /m is bounded by £ as shown in Corollary G.5. □ 



In order to prove Lemma 6.5, we need the following auxiliary results. 

Claim G.4. Suppose all conditions in Lemma 6.5 hold. Let W\ = diag(l(i?i)) 1 / 2 . Let A\ = A* := 
W\A[B\)W\, where A(B X ) is as obtained in Step 2 with A Bl chosen to be lower bounded by 2r//(l — 2rf) 
where rj := Xt n (l + o(l)) + ju. Let A\ := A\ — A*. Then we have on A\for A := A(B{), 



Xbi 



A' 1 


T] 


A' 1 


< 




l,off 1 — 77 




1 



tr^i^ 1 ) < X Bl 



A 



-1 



+ 



>1 



l.oflf 1 — 77 



A' 



(81) 



Corollary G.5. Suppose all conditions in Claim G.4 hold. Let < e\ < 1, and 

2rj 



O(Xb ) where 77 = A/, n (l + o(l)) + /x 



ei(l-JT) 

is as defined as in (34). Clearly = ^A Bl < A Bl /3. Then we have on Ai, for A := .A(2?i), 

m 1 — 77 777 



|trQM^)| 
777 



< e<A Bl 



^ + o(A Bl )<e 



where £ is as defined in (37). 
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Proof of Claim G.4. Throughout this proof, we assume that event A\ holds. On A\, we have by (30), 



Vi + j, 



(Bi) 



and hence 



and Vi, 1 — 77 < — — < 1 + 77, 

1 y/O^ii 1 



< \ f , n (l + o(l) )+ -jgp /2, (82a) 



(82b) 



as diag(A(5i)) = W?, and W^/a^u - 1 < 77 = A/, n (l + o(l)) + /j < 77. By the KKT conditions we 
obtain for A x = A* = and A(B X ) = WiT(A )Wi, where Wi >- 0, 

Aij(Bi) - r i3 (A )| < A Bl , Vl^ 1 ^) = ( hence A~] 3 = 0) 
A ij (B 1 ) - r\(A ) = X Bl , VAr 1 ^) > ( hence A^. > 0) 
and ^-(Sx) -fij(Ao) = 

and hence for all i, j, 



A Bl VA-.^Si) < ( hence A^. < 0), 



Ax,ij - Ay (Si) = Ui^Bi) - ^(Ao) W h 



■j. 1 



if i = j 



( 0_ 

W^iiA Bl Wijj 

[ e [-Wi.iiAs.WL^, Wi^Wl^] if A^. = 



if A^. > 
if A" 1 , < 



Thus we have 



tr((A 1 -A(S 1 ))Ar 1 



^iii.-A,,. ir,. ;; W^CSi)- 1 ^ 

A Bl \AiBx)- 1 



l,off 



The claim is proved if we show that for A := A(£?i), where A{B\) is obtained in Step 2, 



tr ((A(5i) - A*)A^ 
so that for Ai = A\ — A*, we have 



< 



V 

1 — 77 



A" 1 



(83) 



A_b x 



A" 



/ ^ 

m 

l,oflf 1 — 77 



A" 



/m < tr(A i A 1 



-j-tr ( (Ai - A(S 1 ))A r 1 ) + itr ( (A(S X ) - A^A^ 1 



< A Bl 



A" 1 



/to + ^ 

l.off 1—7/ 



A" 



/m. 
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It remains to show (83). Denote by A := A{B{) = W\T(Aq)Wx. Let diag(A*) 1/2 = diag( % /o^H, . . . ^/a^ r 
^diag(ylo) 1/2 - We have 

tr((2- A)^ 1 ) 

= tr ((Wif (A )W! - diag(^) 1 / 2 p(Ao)diag(^) 1 / 2 ) W^A~ X W^ 



tr (f^J- X; fti(^o)A^ 



E 1 ? ^w-^i^ 



v H i,. Wijj 

y/ a *,ii \/ a *,jj 
33 



where by (82a) and (82b), we have 



W lM W l 
Pij( A o) 



+ i 



(84) 



E^ 1 



1,M , 



< 



1 — 7/ 



and 



Pij( A o) 



diag^ 1 ) 



w 1M w 



1,33 



A 



^ E W ( A /,n(i + °w) + mMaod < T 

Clearly (83) holds by inserting the above inequalities in (84). □ 



V 



A.' 1 



l,ofl 



/m. 



P roof of Corollary G.5. Throughout this proof, we assume that event A\ holds. Let A Al = A{B\) 1 - 
p(A )~ 1 . We have for X Bl = 2rj/ei{\ - rj), where < E\ < 1, 



|diag(A Al )| 1 < v^||A Al || F 



< Vm9(l + £i)A Bl 



An 



0,oflf 



V 1 



o(y/m) 



where X Bl J \A 1 \ Q off V 1 = o(l) by (Al), given that \ Bl X = O(Ab ). Now for the non-diagonal 
part of Aa 1 , for < e\ < 1, we have by Corollary 4.4, 



|A Al 



l,oflf 



< 



< 



.4, 



-li 

lo,off 



1 + ei 18r? 



(1 - ei)ei 



l^>/|Vlo^V 1/(^(^0))) 
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where rrze^fe[ * s bounded so long as e\ is bounded away from and 1. Hence 



4" 1 ! 

A lo.oflF 



^0 | ,off 



I. off 



— lA^Ji = o(l/Vm) + o 
m \m 

= \p(A( j )~ 1 \ 1 off jm + o(l), and hence 

A{B x y x /m < \ P {Aq)-\ /m + \^ Al \ x /m = \p{A Q )-\ /m + o{l) 



where clearly 



< 1. We now insert the inequalities above in (81): 



m 



A- 1 



7] 1 

i,off ' 1 — rj m 



+ 



A' 



< A Bl \p(A )-\ oS + o (i^V)) 



+rr^(l' 3 ( yl o) _1 li/ ra + 



< 



\ — r\ 



in 



A 



lo,off 



+ m 



1—17 



p(A Q ) \ 1 /m + X Bl \p(A ) \ loS /m + o(\ Bl ) 



The other bounds follow from the fact that ?y < 7]. The corollary thus holds. □ 
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